Vai al contenuto| Home page|

   Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca
INIZIO_TESTO_DA_INDICIZZARE

RESEARCH PROGRAM

italiano - inglese
Similar research programs:
Scientific and education field classification
International Patent Classification
Geographical classification
Bibliografia
[Ba96]Introduction to Discrete Event Simulation. In proceeding of the
2th DYCOMANS workshop on " Management and Control : Tools in
Action", pp. 367-376 Algarve, Portugal.1996.

[BYSJ02] R. Baeza-Yates, F. Saint-Jean and C. Castillo. Web
structure, dynamics and page quality. In Proceedings of String
Processing and Information Retrieval (SPIRE), volume 2476 of Lecture
Notes in Computer Science, Lisbon, Portugal. Springer. 2002.

[CMRZ04] C. Castillo, M. Marin, A. Rodr?guez, and R. Baeza-Yates.
Scheduling algorithms for Web crawling. In Latin American Web
Conference (WebMedia/LA-WEB), pp 10–17, Riberao Preto, Brazil,
October 2004. IEEE CS Press.

[HH00] M. Henzinger, A. Heydon, M. Mitzenmacher and M. Najork. On
near–uniform url sampling. In Proceedings of the Ninth Conference on
World Wide Web, pages 295–308, Amsterdam, Netherlands. Elsevier
Science. 2000.

[BP98] L. Page,S. Brin Sergey, R. Motwani and T. Winograd. The PageRank citation ranking:
bringing order to the Web. Stanford Digital Library Technologies Project, 1998. http://citeseer.ist.psu.edu/page98pagerank.html

[GM05] Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web, 2005. http://dbpubs.stanford.edu:8090/pub/2004-25

[FM04] D. Fetterly, M. Manasse and M. Najork. Statistical analysis of spam through detection of outliers in histograms. Proceedings of the seventh workshop on the Web and databases (WebDB), 2004. http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077

[DS05] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. Proceedings of the 16th European Conference on Machine Learning (ECML), 2005. http://www.informatik.hu-berlin.de/~scheffer/publications/ecml2005-nigritude.pdf

[AW03] K. Aberer and J. Wu. A framework for decentralized
ranking in web information retrieval. In APWeb,
pages 213–226, 2003.

[AW04] K. Aberer, J. Wu. Towards A Common Framework for Peer-to-Peer Web Retrieval. Book Chapter of From Integrated Publication and Informations Systems to Virtual Information and Knowledge Environments, EJN-Festschrift, Matthias Hemmje Ed., Springer LNCS, 2004.

[WA04] J. Wu and K. Aberer. Using SiteRank for P2P Web Retrieval. Tech Rep IC/2004/31, Swiss Federal Institute of Technology, Lausanne. 2004
[BP98] L. Page,S. Brin Sergey, R. Motwani and T. Winograd. The PageRank citation ranking:
bringing order to the Web. Stanford Digital Library Technologies Project, 1998. http://citeseer.ist.psu.edu/page98pagerank.html

[FM04] D. Fetterly, M. Manasse and M. Najork. Statistical analysis of spam through detection of outliers in histograms. Proceedings of the seventh workshop on the Web and databases (WebDB), 2004. http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077

[DS05] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. Proceedings of the 16th European Conference on Machine Learning (ECML), 2005. http://www.informatik.hu-berlin.de/~scheffer/publications/ecml2005-nigritude.pdf

[WA04b] J. Wu, K. Aberer. Using SiteRank for Decentralized Computation of Web Document Ranking. Proceedings of the Third International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, Eindhoven University of Technology, The Netherlands. 2004

[B05] R. A. Baeza-Yates. Applications of web query mining. In
Advances in Information Retrieval, 27th European Confer-
ence on IR Research, ECIR 2005, Santiago de Compostela,
Spain, March 21-23, 2005, Proceedings, volume 3408 of Lec-
ture Notes in Computer Science, pages 7-22. Springer, 2005.

[NCO04] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In WWW '04:
Proceedings of the 13th international conference on World Wide Web , pages
1--12, New York, NY, USA, 2004. ACM Press.

[KHS03] R. Kraft, E. Hastor, and R. Stata. Timelinks: Exploring the
link structure of the evolving web. In Second Workshop on Algorithms
and Models for the Web-Graph (WAW2003) , 2003.

[KNRT03] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew
Tomkins. On the bursty evolution of blogspace. In WWW '03: Proceedings
of the 12th international conference on World Wide Web, pages 568--576, New York, NY, USA, 2003. ACM Press.

[BKM00] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,
S.Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309--320, June 2000.

[New05] M. E. J. Newman. Power laws, pareto distributions and zipf's law.
Contemporary Physics, 46:323--351, December 2005.

[PaRa02] G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize Web structure.
In Proc. of COCOON, vol. 2387 of Lecture Notes in Computer Science}, pages 330--390, 2002.

[1] H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987.

[6] D. Sivakumar R. Fagin, R. Kumar. Efficient similarity search and classification via rank aggregation. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 301–312, New York, NY, USA, 2003. ACM Press.

[10] L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine ’01, Workshop on Text Mining, First SIAM International Conference on Data Mining, Chicago, IL, (2001).

[14] Jon M. Kleinberg. Two Algorithms for Nearest-Neighbor Search in High Dimensions. In STOC, pages 599–608, 1997.

[18] A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47–57, 1984.

[26] J. Goldstein and Raghu Ramakrishnan, Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. Proceedings of the 26th International Conference on Very Large Data Bases (2000), 429 - 440


[101] Junghoo Cho and Hector Garcia-Molina. Parallel crawlers. In Proc. of the Eleventh International World-Wide Web conference (WWW 2002), pages 124-135, Honolulu, Hawaii, 2002. ACM Press.

[102] Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. World Wide Web, pages 219-229, December 1999.

[103] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711- 726, 2004.

[104] Keith Randall, Raymie Stata, Rajiv Wickremesinghe, and Janet L. Wiener. The LINK database: Fast access to graphs of the Web. Research Report 175, Compaq Systems Research Center, Palo Alto, CA, 2001.

[105] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: Fast access to
linkage information on the web. In Proceedings of the Seventh International World–Wide Web Conference, pages 469–477, Brisbane, Australia, 1998.

[106] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference, pages 595- 601, Manhattan, USA, 2004. ACM Press.

[109] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, September 1999.

[110] Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Do your worst to make the best: Paradoxical effects in PageRank incremental computations. Internet Math., 2(3):387- 404, 2005.

[111] Paolo Boldi, Massimo Santini, and Sebastiano Vigna. PageRank as a function of the damping factor. In Proc. of the Fourteenth International World Wide Web Conference, Chiba, Japan, 2005. ACM Press.

[112] Sebastiano Vigna. TruRank: Taking PageRank to the limit. In Poster Proc. of the Fourteenth International World Wide Web Conference, pages 557 566, Chiba, Japan, 2005. ACM Press.
Keywords
ALGORITHMS AND DATA STRUCTURES FOR THE WEB, LINK ANALYSIS, CRAWLING, SPAM DETECTION, NEAREST NEIGHBOURS, RANK AGGREGATION, GRAFO DEL WEB, COMPRESSION

Web Ram: Web Retrieval and Mining

Università degli Studi di Roma "La Sapienza"
Abstract
This research program concerns a set of open problems in the field of contemporary research for web information retrieval (web engines). We want to investigate a set of problems that are relevant enough to be pursued independently but that, at the same time, interact at various levels in a coherent whole:

1. Crawling:

- we want to collect a series of snapshots of a sizeable portion of the web (eg. the .it domain) in order to (a) investigate the temporal evolution of the web and (b) develop techniques that service queries with documents that are not only relevant but also up-to-date

- we want to continue our study of crawling strategies that combine the good properties of known approaches. In particular we are looking for strategies that are very efficient in terms of memory utilization while at the same time "polite" and capable of retrieving high quality pages first


2. Spam: We want to develop new algorithms that are able to detect spam and rank web pages accordingly

3. Signal aggregation: Current web search technology uses a variety of clues and assembles them together to assess the relevance of a document to a query. We have a promising novel approach to combine two of the most important clues, PageRank and cosine similarity, that we intend to investigate. We will also investigate how to combine other important clues: query streams, click streams and text mining. We are particularly interested in scalable and efficient distributed implementation of such schemes.


4. New data structures for indexing schemes:

- Skip placement: The basic data structure to index web corpora is the inverted list. So-called skips are used to speed up boolean queries. There seems to be no systematic investigation of the best strategy for skip placement: this is what we intend to do.

- Nearest Neighbours: It is well-known that the basic task of finding documents that are relevant to a given query boils down to the following geometric problem. Given a set of points and a query point Q, find the K closest points to Q. We have new, very promising algorithms and data structures that we intend to study and test against the best state-of-the-art solutions.

5. Foundational issues:

- We intend to investigate a novel and very promising generative model to generate text corpora.

- Models for self-similar finite graphs: The self-similarity of the web graph is considered by many as an established fact but in reality there is not even a precise definition of self-similarity, let alone a proof. We would like to suggest some new definitions in this direction and apply them to the web graph to corroborate empirically the existing claims about its fractal nature.

- Axiomatisation of ranking systems: Ranking systems for the Web are similar in nature to voting systems and social choice systems. The latter have been studied using axiomatic approaches that lead to celebrated results such as Arrow's Impossibility Theorem. Attempts to apply similar principled approaches to PageRank have appeared recently. We would like to pursue this line of research, with applications to other ranking techniques. <<<

Principal Investigator
Alessandro Panconesi Università degli Studi di ROMA "La Sapienza"
Research Objectives
Information Retrieval for the Web (Web IR) is a thriving area of contemporary computer-science research of great industrial relevance. Its development clearly demonstrates that many of the crucial ideas that eventually lead to the most successful applications are firmly grounded in sound and mathematically sophisticated research in algorithms and data structures. For instance, it is well-known that the leading web engine by Google is based on PageRank, a mechanism to assign "quality" to web pages that boils down to the (efficient!) computation of the stationary distribution of a Markov chain. Likewise, the commercial success of Teoma, one of the world leaders of web engine technology, is based on the efficient implementation of an ingenious ranking algorithm due to Jon Kleinberg and known as Hits. Examples abound.

Research in Web IR naturally proceeds along four different lines that, to be most effective, must continually interact as in a virtuous circle:

1. Data collection: To collect snapshots of the Web

2. Data analysis: To perform data mining and statistical analysis of web data in order to detect oddities and regularities of the web structure.

3. Algorithms and data structures: To develop cutting edge applications

4. Mathematical modelling and its validation: To propose models that reproduce crucial features of web data.

Web IR is to a large extent an experimental science that relies heavily on data. Due to the huge proportions of the Web, data collection is in itself a research endeavour. Data collection continues to be of paramount importance because (a) there are very few data repositories in the public domain and (b) the web changes with time and must be monitored accordingly. Members of our consortium are recognised world-leaders in this field and therefore our project gives the best assurance that this fundamental data collecting activity will be carried out effectively.

It is perhaps worth mentioning that this aspect of our research agenda has very important political and sociological implications. Web-search technology is currently dominated by very few giant overseas corporations that can bring to bear their enourmous resources. It would be a very unhealthy development, with far-reaching social implications, if the know-how for this important technology were to become their exclusive monopoly. Data collection is VITAL to counter this dangerous trend and to ensure that the know-how in the public, university domain stays up-to-date.

While collecting data is the bedrock of web research, it is by mining and analysing the data that useful algorithmic ideas and tools can come about. Regularities such as the bow-tie structure of the web, or oddities such as spam structures lead to better understanding and, as a consequence, to better application tools.

Following the categorization above, these are the activities that we intend to pursue:

1. Data collection:

1.1 We want to collect snapshots of the Web and make them publicly available. In this fashion we will provide a FUNDAMENTAL infrastructure for all kinds of web research at the national and international level.

1.2 We will focus in particular on the dynamic aspects of web evolution: by taking several snapshots at regular intervals of sizeable domains (such as .it) we will be able to provide, for the first time, data sets that are the indispensable prerequisite to investigate the dynamic evolution of the Web. Thanks to this activity it will also possible to develop techniques that service queries with documents that are not only relevant but also up-to-date.

1.3 We want to investigate new crawling techniques that are resource (especially memory) aware.

2. Data mining and statistical analysis:

2.1 As a follow up to 1.2, we will investigate the dynamic evolution of the Web

2.2 Spam detection: This is one of the most important problems in current Web IR. We have a promising approach that we hope will lead to algorithms capable to detect spam pages and to reduce their ranking with respect to non-spam documents.

3. Algorithms and data structures:

3.1 Aggregating different signals: Current web search technology uses a variety of clues and assembles them together to assess the relevance of a document to a query. We have a promising novel approach to combine two of the most important clues, PageRank and cosine similarity, that we intend to investigate. We will also investigate the aggregation of link analysis with other valuable sources of information such query streams, click streams and text mining. We are particularly interested in scalable and efficient distributed implementation of such schemes.

3.2 Skip placement: The basic data structure to index web corpora is the inverted list. So-called skips are used to speed up boolean queries. There seems to be no systematic investigation of the best strategy for skip placement: this is what we intend to do.

3.3 Nearest Neighbours: It is well-known that the basic task of finding documents that are relevant to a given query boils down to the following geometric problem. Given a set of points and a query point Q, find the K closest points to Q. We have new, very promising algorithms that we intend to study and tests against the best state-of-the-art solutions.

3.4 Distributed (Page) Rank computation: Consider a distributed system, organized as a P2P network in which each node as partial knowledge of a larger portion of the web graph. We would like to develop new scalable, communication-efficient algorithms that can rank the entire web graph in a distributed fashion.

3.5 Spam-resilient ranking: we intend to put to test a novel approach to ranking algorithms that avoid spam.


4. Mathematical modelling and its validation:

4.1 We intend to investigate a novel and very promising generative model to generate text corpora. This goes hand in hand with 3.3

4.2 Models for self-similar finite graphs: The self-similarity of the web graph is considered by many as an established fact but in reality there is not even a precise definition of self-similarity, let alone a proof. The main obstacle here is that the usual, mathematically precise, notions of self-similarity apply to infinite objects (Hausdorff vs. topological dimension). We would like to suggest some new definitions in this direction and apply them to the web graph to corroborate empirically the existing claims about its fractal nature.

4.3 Axiomatisation of ranking systems: Ranking systems for the Web are similar in nature to voting systems and social choice systems. The latter have been studied using axiomatic approaches that lead to celebrated results such as Arrow's Impossibility Theorem. Attempts to apply similar principled approaches to PageRank have appeared recently. We would like to pursue this line of research, with applications to other ranking techniques.

Although each activity deserves to be pursued in its own right, they form a coeherent whole since, in the end, they just tackle different aspects of an integrated whole: a high-performance search engine (NOTE: we are not promising to build such a thing but only emphasising existing relationships among parts of our research). It is by pursuing them in parallel that we might reap the highest rewards. For instance, it is by analizing the power-law structure of the search trees implicitely generated by the crawling process that members of our team could develop a compression technique for web graphs that, by using only 3 bits per edge on average, is the veritable world-record holder. In turn, compression techniques allow for more efficient crawls thus creating a virtuous circle. Our consortium, by virtue of the close professional ties among its members, ensures that this kind of virtuos interaction between different kinds of tasks and expertise will take place, so that the entire research effort will be pursued as a coherent whole, as it is advisable. <<<
Timescale
24 months
National and international background
In what follows we briefly review the state of the art concerning the various problems that we intend to investigate within the research unit at La Sapienza.


NEAREST NEIGHBOURS:

Finding the most relevant documents to a given query in many instances boils down to the following geometric problem: given a set S in some d-dimensional spcae, and a query point Q, find the closest K points to Q. In typical applications the dimension of the space is very large or even huge. This fact, known as the "curse of dimensionality", effectively rules out the sophisticated solutions developed in the context of computational geometry, since these algorithms typically depend exponentially in the number of dimensions (see, for instance, [1]).

Concerning retreival applications, one interesting solution are the so-called p-spheres [26]. This method requires a sample of the query distribution to be known in advance and uses this knowledge to partition the space. The idea is that, with approprate data structures, the query point Q can zero in on the portion of the space that is most relevant.

In [6] the authors propose approximation algorithms that are based on the notion of rank aggregation (a notion that goes back to the French Revolution when Condorcet was studying voting schemes). The corpus and the query are projected on a random vector. The document which is closest to the query on this vector gets one “vote”. Theoretical results of [14] suggests that by using a small set of random vectors the document closest to the query will obtain lots of votes with high probability. Experimental work of [6] show that this approach works very well for some practical applications. However these experiments are performed for a corpus whose points are dense vectors and have relatively low-dimension (a few dozens). It remains to be seen how this scheme performs for text retrieval applications. The same holds for [26].

In [7] the authos use hashing to find approximate nearest neighbours. They give a theoretical analysis showing that the method locates the nearest neighbour with high probability, and give experimental results for low-dimension spaces. The papers [10, 11, 12] study the problme for cosine similarity, as opposed to euclidean distance.

Some other approaches are based on suitable data structures such as the R-tree [18] and its many variants (see e.g. the survey [24]). A variant specifically designed for nearest neighour queries, the SR-tree is proposed in [23]. A comparative study between SR-trees and P-spheres suggests that the latter might have better performance [26].



SAMPLING AND CRAWLING OF THE WEB

References: [Ba96, BY05, BYSJ02, HH00]

Understanding the topological structure of the World Wide Web is a useful but daunting task (current estimates are of 11,5 billion of pages). This motiveates the use of sampling to infer its topological properties.

There are basically two types of sampling techniques that are used for the Web: sampling by random walks, that implies starting at a given page and follow out-links at random, and vertical sampling, that involves delimiting in advance a set of Web pages to be crawled.

A number of studies analyzes different strategies such as larger-sites first, breadth-first, OPIC, partial-pagerank. The main aim of these studies is to find out which strategy is able to download important pages first.

Crawling is clearly a fundamental problem and still very much an active area of research. Again, tha main issue is size and how to direct the search in ways that might collect “useful” portions of the web, while keeping the resource utilization (such as memory, time and bandwidth) within reasonable limits.

The crawling problem, that is, the problem of collecting data on the web starting from a known seed, is well known and studied since the creation of the first search engines. In one of the first scientific papers on the subject [101] Heydon e Najork described the Altavista crawler, setting many of the feature that have by now become common, such as compact URL representation by fingerprinting.

The quick growth in size of the web made it necessary to parallelize significantly the crawling process [102]. This is the purpose behind the creation of UbiCrawler, a web crawler composed by autonomous self-coordinating agents written in Java [103]. An arbitrarily large number of UbiCrawler agents can be started, increasing linearly the amount of information absorbed from the net.

THE WEB GRAPH:

Once a web snapshot has been collected, one of the most interesting aspects is the analsys of the web graph, that is, the graph having pages as nodes and an arc from x to y if x contains a hypertext link to y. The analysis and manipulation of graphs of such a large size (billions of links) requires new representation techniques. The first, seminal work on the subject gave rise to the LINK database and to the Connectivity Server [104,105]. Other proposals have appeared in the following years, refining the compression ratio. Presently, the best available compression is provided by WebGraph [106] (http://webgraph.dsi.unimi.it/). WebGraph provides free software and large-size dataset that can be used with minimal infrastructure; several datasets have been gathered by UbiCrawler, and are currently used by several research groups worldwide.

One of the reasons why the web graph is so important is that search engines use its structure to rank statically pages (i.e., in a way that is independent of the user query). The most famous example is certainly PageRank, the algorithm used by Google, but other algorithms have been proposed in the literature, for instance HITS [14], used by AskJeeves. The study of PageRank mixes aspects of linear algebra and graph theory. For instance, using as a correlation measure Kendall's tau, we have shown that breadth-first visits, which gather quickly pages with high PageRank, create partial snapshot that misrepresent the relative rankings: that is, one gathers quickly important pages, but their relative order (computed during the crawl) is badly approximated [110]. Always in this direction, we have been able to provide closed formulae for all derivatives of PageRank as a function of the damping factor [111]. This study has led to the definition of interesting variations of PageRank that resist damping decrease better, for instance, TruRank [112]. All the activities above have led to the implementation of sophisticated statistical and linear-analysis programs that are distributed as free software (http://law.dsi.unimi.it/).



SPAM DETECTION

References: [BP98, GM05, FM04, DS05]

Link spam is nowadays widespread in the Web and its presence affects the quality of the information retrieved in response to user queries. Detecting link spam is thus becoming increasingly important.

Previous work has shown that the presence of link spamming is often related to anomalies in the distribution of certain properties of the Web graph, such as in- and out-degree, or to the presence of dense sub-graphs. Also, other contributions show that spam pages are expected to be sensitive to changes in the damping factor of Google's PageRank algorithm. Still, the interplay of all these factors is not clear. One approach consists in detecting a set of potential spam pages having high rank and then explore the pages that are close to them in the graph to search for anomalies and verify that the high rank was not obtained by deceptive means. A complementary approach consists in devising an initial set of trusted pages and propagating trust across their links, so as to devise communities of potentially spam-free pages. Heavily linked with little or no relationship to a community of trusted pages are suspicious. Both method require manual-like detection of an initial core, while the first method can be applied to only a few pages at the time. An approach that is orthogonal to the previous ones is content-based analysis, which considers the distribution of relevant features such as page size or distribution of keywords within a manually tagged set of pages.


INTEGRATING LINK WITH CONTENT, WEB LOGS AND TEMPORAL ANALYSIS:

Link analysis provides a very powerful tool for ranking and
classifying Web documents. Still, in oredr to get high quality results it must be combined with other valuable sources of information like time information, content (text, multimedia), and Web usage (transactions from Web logs).

The role of time is becoming more and more crucial in order to develop
search-engine algorithms able to provide the final user with the most
up-to-date results possible. In the past decade, the Web has experienced
a very rapid growing rate. Because of this huge amount of data, the search
engines have to constantly afford the burdensome task of updating their
index in order to keep in touch with the evolving Web. Most of the existing literature concentrated on large samples of the Web collected over few months of crawling activity [BKM00, DLSa, DLSb]. Very little work has been devoted to analyze the temporal evolution of the Web. To capture the relation between the popularity, authority and time, some
recent studies [KHS03, KNRT03] have presented structures that directly couple hyperlinks with temporal data.

Server logs of search engines store traces of queries submitted by users, which include queries themselves along with Web pages selected in their answers. Query mining is based in the fact that user queries in search engines and Websites give valuable information on the interests of people. In addition, clicks after queries relate those interests to actual content [B05]. <<<