Vai al contenuto| Home page|

   Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca
INIZIO_TESTO_DA_INDICIZZARE

RESEARCH PROGRAM

italiano - inglese
Similar research programs:
Scientific and education field classification
International Patent Classification
Geographical classification
Bibliografia
[Ba96]Introduction to Discrete Event Simulation. In proceeding of the
2th DYCOMANS workshop on " Management and Control : Tools in
Action", pp. 367-376 Algarve, Portugal.1996.

[BYSJ02] R. Baeza-Yates, F. Saint-Jean and C. Castillo. Web
structure, dynamics and page quality. In Proceedings of String
Processing and Information Retrieval (SPIRE), volume 2476 of Lecture
Notes in Computer Science, Lisbon, Portugal. Springer. 2002.

[CMRZ04] C. Castillo, M. Marin, A. Rodr?guez, and R. Baeza-Yates.
Scheduling algorithms for Web crawling. In Latin American Web
Conference (WebMedia/LA-WEB), pp 10–17, Riberao Preto, Brazil,
October 2004. IEEE CS Press.

[HH00] M. Henzinger, A. Heydon, M. Mitzenmacher and M. Najork. On
near–uniform url sampling. In Proceedings of the Ninth Conference on
World Wide Web, pages 295–308, Amsterdam, Netherlands. Elsevier
Science. 2000.

[BP98] L. Page,S. Brin Sergey, R. Motwani and T. Winograd. The PageRank citation ranking:
bringing order to the Web. Stanford Digital Library Technologies Project, 1998. http://citeseer.ist.psu.edu/page98pagerank.html

[GM05] Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web, 2005. http://dbpubs.stanford.edu:8090/pub/2004-25

[FM04] D. Fetterly, M. Manasse and M. Najork. Statistical analysis of spam through detection of outliers in histograms. Proceedings of the seventh workshop on the Web and databases (WebDB), 2004. http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077

[DS05] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. Proceedings of the 16th European Conference on Machine Learning (ECML), 2005. http://www.informatik.hu-berlin.de/~scheffer/publications/ecml2005-nigritude.pdf

[AW03] K. Aberer and J. Wu. A framework for decentralized
ranking in web information retrieval. In APWeb,
pages 213–226, 2003.

[AW04] K. Aberer, J. Wu. Towards A Common Framework for Peer-to-Peer Web Retrieval. Book Chapter of From Integrated Publication and Informations Systems to Virtual Information and Knowledge Environments, EJN-Festschrift, Matthias Hemmje Ed., Springer LNCS, 2004.

[WA04] J. Wu and K. Aberer. Using SiteRank for P2P Web Retrieval. Tech Rep IC/2004/31, Swiss Federal Institute of Technology, Lausanne. 2004
[BP98] L. Page,S. Brin Sergey, R. Motwani and T. Winograd. The PageRank citation ranking:
bringing order to the Web. Stanford Digital Library Technologies Project, 1998. http://citeseer.ist.psu.edu/page98pagerank.html

[FM04] D. Fetterly, M. Manasse and M. Najork. Statistical analysis of spam through detection of outliers in histograms. Proceedings of the seventh workshop on the Web and databases (WebDB), 2004. http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077

[DS05] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. Proceedings of the 16th European Conference on Machine Learning (ECML), 2005. http://www.informatik.hu-berlin.de/~scheffer/publications/ecml2005-nigritude.pdf

[WA04b] J. Wu, K. Aberer. Using SiteRank for Decentralized Computation of Web Document Ranking. Proceedings of the Third International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, Eindhoven University of Technology, The Netherlands. 2004

[B05] R. A. Baeza-Yates. Applications of web query mining. In
Advances in Information Retrieval, 27th European Confer-
ence on IR Research, ECIR 2005, Santiago de Compostela,
Spain, March 21-23, 2005, Proceedings, volume 3408 of Lec-
ture Notes in Computer Science, pages 7-22. Springer, 2005.

[NCO04] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In WWW '04:
Proceedings of the 13th international conference on World Wide Web , pages
1--12, New York, NY, USA, 2004. ACM Press.

[KHS03] R. Kraft, E. Hastor, and R. Stata. Timelinks: Exploring the
link structure of the evolving web. In Second Workshop on Algorithms
and Models for the Web-Graph (WAW2003) , 2003.

[KNRT03] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew
Tomkins. On the bursty evolution of blogspace. In WWW '03: Proceedings
of the 12th international conference on World Wide Web, pages 568--576, New York, NY, USA, 2003. ACM Press.

[BKM00] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,
S.Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309--320, June 2000.

[New05] M. E. J. Newman. Power laws, pareto distributions and zipf's law.
Contemporary Physics, 46:323--351, December 2005.

[PaRa02] G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize Web structure.
In Proc. of COCOON, vol. 2387 of Lecture Notes in Computer Science}, pages 330--390, 2002.

[1] H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987.

[6] D. Sivakumar R. Fagin, R. Kumar. Efficient similarity search and classification via rank aggregation. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 301–312, New York, NY, USA, 2003. ACM Press.

[10] L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine ’01, Workshop on Text Mining, First SIAM International Conference on Data Mining, Chicago, IL, (2001).

[14] Jon M. Kleinberg. Two Algorithms for Nearest-Neighbor Search in High Dimensions. In STOC, pages 599–608, 1997.

[18] A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47–57, 1984.

[26] J. Goldstein and Raghu Ramakrishnan, Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. Proceedings of the 26th International Conference on Very Large Data Bases (2000), 429 - 440


[101] Junghoo Cho and Hector Garcia-Molina. Parallel crawlers. In Proc. of the Eleventh International World-Wide Web conference (WWW 2002), pages 124-135, Honolulu, Hawaii, 2002. ACM Press.

[102] Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. World Wide Web, pages 219-229, December 1999.

[103] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711- 726, 2004.

[104] Keith Randall, Raymie Stata, Rajiv Wickremesinghe, and Janet L. Wiener. The LINK database: Fast access to graphs of the Web. Research Report 175, Compaq Systems Research Center, Palo Alto, CA, 2001.

[105] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: Fast access to
linkage information on the web. In Proceedings of the Seventh International World–Wide Web Conference, pages 469–477, Brisbane, Australia, 1998.

[106] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference, pages 595- 601, Manhattan, USA, 2004. ACM Press.

[109] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, September 1999.

[110] Paolo Boldi, Massimo Santini, and Sebastiano Vigna. Do your worst to make the best: Paradoxical effects in PageRank incremental computations. Internet Math., 2(3):387- 404, 2005.

[111] Paolo Boldi, Massimo Santini, and Sebastiano Vigna. PageRank as a function of the damping factor. In Proc. of the Fourteenth International World Wide Web Conference, Chiba, Japan, 2005. ACM Press.

[112] Sebastiano Vigna. TruRank: Taking PageRank to the limit. In Poster Proc. of the Fourteenth International World Wide Web Conference, pages 557 566, Chiba, Japan, 2005. ACM Press.
Keywords
ALGORITHMS AND DATA STRUCTURES FOR THE WEB, LINK ANALYSIS, CRAWLING, SPAM DETECTION, NEAREST NEIGHBOURS, RANK AGGREGATION, GRAFO DEL WEB, COMPRESSION

Web Ram: Web Retrieval and Mining

Università degli Studi di Roma "La Sapienza"
Abstract
This research program concerns a set of open problems in the field of contemporary research for web information retrieval (web engines). We want to investigate a set of problems that are relevant enough to be pursued independently but that, at the same time, interact at various levels in a coherent whole:

1. Crawling:

- we want to collect a series of snapshots of a sizeable portion of the web (eg. the .it domain) in order to (a) investigate the temporal evolution of the web and (b) develop techniques that service queries with documents that are not only relevant but also up-to-date

- we want to continue our study of crawling strategies that combine the good properties of known approaches. In particular we are looking for strategies that are very efficient in terms of memory utilization while at the same time "polite" and capable of retrieving high quality pages first


2. Spam: We want to develop new algorithms that are able to detect spam and rank web pages accordingly

3. Signal aggregation: Current web search technology uses a variety of clues and assembles them together to assess the relevance of a document to a query. We have a promising novel approach to combine two of the most important clues, PageRank and cosine similarity, that we intend to investigate. We will also investigate how to combine other important clues: query streams, click streams and text mining. We are >>>

Principal Investigator
Alessandro Panconesi Università degli Studi di ROMA "La Sapienza"
Research Objectives
Information Retrieval for the Web (Web IR) is a thriving area of contemporary computer-science research of great industrial relevance. Its development clearly demonstrates that many of the crucial ideas that eventually lead to the most successful applications are firmly grounded in sound and mathematically sophisticated research in algorithms and data structures. For instance, it is well-known that the leading web engine by Google is based on PageRank, a mechanism to assign "quality" to web pages that boils down to the (efficient!) computation of the stationary distribution of a Markov chain. Likewise, the commercial success of Teoma, one of the world leaders of web engine technology, is based on the efficient implementation of an ingenious ranking algorithm due to Jon Kleinberg and known as Hits. Examples abound.

Research in Web IR naturally proceeds along four different lines that, to be most effective, must continually interact as in a virtuous circle:

1. Data collection: To collect snapshots of the Web

2. Data analysis: To perform data mining and statistical analysis of web data in order to detect oddities and regularities of the web structure.

3. Algorithms and data structures: To develop cutting edge applications

4. Mathematical modelling and its validation: To propose models that reproduce crucial features of web data.

Web IR is to a large extent an experimental science >>>

Timescale
24 months
National and international background
In what follows we briefly review the state of the art concerning the various problems that we intend to investigate within the research unit at La Sapienza.


NEAREST NEIGHBOURS:

Finding the most relevant documents to a given query in many instances boils down to the following geometric problem: given a set S in some d-dimensional spcae, and a query point Q, find the closest K points to Q. In typical applications the dimension of the space is very large or even huge. This fact, known as the "curse of dimensionality", effectively rules out the sophisticated solutions developed in the context of computational geometry, since these algorithms typically depend exponentially in the number of dimensions (see, for instance, [1]).

Concerning retreival applications, one interesting solution are the so-called p-spheres [26]. This method requires a sample of the query distribution to be known in advance and uses this knowledge to partition the space. The idea is that, with approprate data structures, the query point Q can zero in on the portion of the space that is most relevant.

In [6] the authors propose approximation algorithms that are based on the notion of rank aggregation (a notion that goes back to the French Revolution when Condorcet was studying voting schemes). The corpus and the query are projected on a random vector. The document which is closest to the query on this vector gets one “vote” >>>