Contenuto
Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca»Unità di ricercaINIZIO_TESTO_DA_INDICIZZARE
UNITA' DI RICERCA
italiano - english
Research program
Web Ram: Web Retrieval and MiningUniversity Co-ordinator
Università degli Studi di ROMA "La Sapienza" - INFORMATICA - ()Research Unit Leader
Alessandro PanconesiDescription
In the following we describe the main research threads that will be pursued by the research unit of La Sapienza University of Rome. We focus on a set of interrelated problems that appear interesting enough to be pursued independently. They are however inter-related and our research unit will explore these interconnections.Nearest neighbours:
We are given a set S of n points in d-dimensional space. Given a new query point q, we seek to find the K points in S nearest to q, under some specified metric. This K-nearest neighbor problem arises persistently in information retrieval, multimedial retrieval, pattern matching, supervised learning/classification and data mining. Efficient solutions to this problem are both critical and few.
In these applications the dimensions are "features" arising from documents, images, media, fingerprints or other objects. In text retrieval applications, each object in the corpus (e.g. a document) is turned into a point in the space of features (e.g terms), as is a retrieval query. Retrieval is then equivalent to finding the nearest neighbors to the query point. Because the number of features is almost always large (ranging from several tens for simple image retrieval to hundreds of millions for text retrieval on a web scale), the dimensionality of the space is too large to invoke methods from computational geometry such as Voronoi partitions. Consequently, it it difficult in the worst case to do much better than the following naive algorithm: measure the distance from the query point to each object point, then select the K shortest distances. Note that in this naive scheme, the exhaustive computation of distances from the query to every data point is done at query time, rendering it a crucial bottleneck in query processing.
To avoid such exhaustive distance computation and speed up query processing, the key is to quickly eliminate from consideration many of the data objects; such "pruning" is the goal of any indexing scheme. We have a new, extremely simple pruning scheme that deserves to be studied. Very encouraging preliminary results obtained in collaboration with Prabhakar Raghavan, Head of Yahoo! Research, and prof. Eli Upfal, of Brown University, indicate that for text retrieval applications it might very well outperform state-of-the-art solutions. We plan to perform a systematic study of our scheme by comparing it to well-known state-of-the-art solutions such as p-spheres, SR-trees and rank aggregation. The available studies concerning the latter schemes concern low-dimensional data sets with dense vectors, whereas in text retrieval applications the dimension is huge and the vectors extremely sparse. Thus, as a side effect, such a study will shed light on the performance of these well-known methods for text retrieval applications. We also plan to study the problem analytically. We have a very plausible generative model for text data under which our scheme is rigorously (i.e. mathematically) proven to perform well. We plan to investigate in a rigorous and systematic fashion the plausibility of our model to see how well real data fit it.
Spam detection:
Our main research goal here is to provide algorithmic tools for automatically detecting web-link spamming, and for demoting pages that are suspect of being spam. A first step is to provide an effective characterization of link spamming in terms of the underlying web graph. The key role here is played by so-called link farms. A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link-based ranking algorithm. Link-analysis algorithms assume that every link represents an endorsement, in the sense that a link to a page is an indication of the "quality" of the pointed page. The supporters of a page are all pages that contribute towards its link-based ranking, either directly, or along a directed paths in the Web graph that leads to the supported page. We intend to exploit some typical features of link farms against spammers. One of these is that most of the contribution to the rank of a spam page in a link farm comes from pages that are close to it, normally only a few hops apart, thus ignoring endorsments due to far-away pages. A page that participates in a link farm may have a high in-degree, but little relationship with the rest of the graph. Such a page will often have a large number of distinct supporters at short distances, but this number should be lower than expected at higher distances. The main problem here is that, given the huge data size typical of web data efficiency and memory constraints are major issues. Moreover, efficiency suggests performing the analysis in parallel for all nodes at the same time to amortize costs. For this reason, we intend to propose algorithms that perform the required computation in parallel, using the same recursive structure of Google's PageRank algorithm. This would have the advantage of providing a method for computing the rank of pages and detecting spam communities at the same time.
The distribution of Pagerank:
The main goal here is to understand the behaviour of the PageRank distribution and its relationship to the topological properties of the underlying Web graph. To this end, a first step is to understand the quality of the power law approximation for the indegree distribution in Web graphs. In some cases, this distribution does not seem to follow a log-normal law. Our first step will be to perform an extensive experimentation with the aim of developing more refined models of the indegree distribution, more closely fitting observed data. Our second goal is to relate the distribution of PageRank to the topological characteristics of the underlying graph. This will be achieved with respect to random graph models that describe key characteristics of Web graphs, such as the indegree distribution.
Crawling strategies:
Web crawlers are used by Web search engines to visit Web pages automatically, by recursively following links. A queue of pages to be visited is initialized with a given set of starting pages, and the newly discovered pages are added to this queue. The queue may become very large mostly due to the existence of a large number of dynamic pages, that is, pages that are generated automatically by querying a data source.
As the Web has potentially an infinite number of pages and most of them will never be visited by the crawler, we aim at keeping the queue of waiting pages relatively small. Typical prioritization policies are (in essence) breadth-first search (BFS) and depth-first search (DFS).
Using breadth-first search the queue size grows very fast, up to a maximum of roughly 15% of the total number of pages (which can be in the order of billions of pages in the case of the full Web), whereas in depth-first search the queue tends to be much smaller. However, depth-first search cannot be used in practice for several reasons: it requires to focus the crawler on a few sites, which might be against typical "politeness" policies; it fails to capture pages with high quality and it might be too prone to get caught into artificially crafted Web-page loops, also known as "crawler traps".
The goal of our research if to find a strategy that has the good properties of breadth-first search in terms of quality of the retrieved pages and politeness with Web servers, but that uses a queue size comparable to that of depth-first search. Reducing memory usage in Web crawling has a number of benefits. First, in practice most of the pages in the queue will never be visited; second, in the case of parallel crawlers that must exchange URLs, the amount of URLs exchanged can be reduced; third, in the case of focused crawlings by personal Web crawlers (that run on normal desktop PCs instead of large servers), keeping
the usage of resources of the crawler small is important.
A second set of activities concerns a comparative evaluation of different crawling strategies, using a crawler simulator purposedly developed for this scope. The main focus will be the minimization of the crawling time. The simulator will take into account several parameters such as:
(a) The time of the simulation
(b) The number of active agents (threads) that simultaneously download the pages
(c) The nominal bandwidth of each site
(d) The effective bandwidth being used
(e) The total available bandwidth
(f) The interval in seconds between requests to a single Web site.
(d) The crawling policy (i.e. largest-site-first, fastest-site-first, smallest-site-first and so on)
We intend to accomplishing an extensive experimental activity in order to test not only a large number of crawling strategies (or a combination of some of them) but also to tune up the parameters (for example, the number of active agents versus the available bandwidth) that characterize the real crawler activity. We remark that our unit is in possession of a functioning prototype of this kind of simulator.
Sampling:
The success of link analysis and of other topological notions such as the bow-tie structure of the Web illustrate the importance of understanding the topology of the web graph, the graph defined by web pages and the links between them. Due to its enormous size, estimated to be in the order of many billions of pages, it becomes important to try to infer as much topological information as possible about the web graph by means of sampling small portions of the web. Several sampling methodologies have been proposed in the literature and we plan to perfoprm a systematic comparative evaluation of them. In particular, for every approach we aim at determining the size of the sample that is required in order to make the predictions reliable.
Aggregating signals:
Modern search engines use a variety of cues (signals) to rank documents with respect to their relevance to a given query. Two of the most informative signals are PageRank and cosine similarity. We have a novel approach to combine the two that is computationally very efficient while retaining quality. We also have a very rigorous framework within which we can test the effectiveness of the approach in a systematic way. This is very important since experimental work of this kind is usual heuristic in nature and lacks rigour.
Important new cues can come from mining temporal information. Although the Web evolves through time, very little work has been devoted so-far to analyze the temporal evolution of the Web and, even less, to try to exploit temporal information. To capture the relation between the popularity, authority and time, some recent studies [ACHLS04, KHS03, KNRT03] have presented structures that directly couple hyperlinks with temporal data. We intend to pursue this promising line of research.
Finally, cues from server logs of search engines can be a very useful source of information, to be exploited using collaborative-filtering approaches. In particular query mining might give valuable information on the interests of people. Likewise, mining clickstreams, i.e. clicks made after queries, can give very precise indications relating interests to actual content [B05]. Clustering and collaborative-filtering techniques can improve the relevance and accuracy of the given answers of a search engines. We intend to pursue this line of reaserch further.
The challenge is to define new concepts and methods that allow to reconstruct a picture of the Web at a specific time from data collected over time and available in Web repositories, design methods that allow to associate vertices and hyperlinks with temporal information, determine the evolution over time of the main statistical properties of the Webgraph, as for instance degree-distribution, connectivity properties,
spectral properties, quality of pages, the topical regions of the Web. This activity has close relation to the design of better ranking and clustering algorithms that take into account the temporal information of the Web graph. We plan for instance to study the time evolution of the connected components of the Web and identify typical temporal patterns of transition of Web documents within this structure. This analysis should
in principle allow to characterize freshness and authoritativeness of documents.
An important challenge is to incorporate link analysis with content analysis in a manner that preserves scalability and fast retrieval of documents. We would like to investigate the introduction of semantic links between Web document with the objective to enrich link analysis with information obtained by applying Web mining techniques to text and web
logs information. The linked structure can be for instance integrated with links between documents obtained by computing distance between their projections into a vectorial space or through the analysis of query and click streams from expert users.
We will test methods for incorporating time and content analysis into link analysis ranking algorithms by collecting periodic crawls of the Web within the project. An alternative valuable source of data that will also extensively be used in the project is provided by the free on-line distributed encyclopedia Wikipedia available in several different
languages. This is nowdays considered one of the most authoritative sources of information between those available inside and outside the Web. The Wikipedia database allows to reconstruct a snapshot of the data set at any given time by reporting together with the documents all updates provided by the content creators. Wikipedia also allows to draw similarities between pages based on content creators and makes available a
taxonomy of documents based on a set of predefined categories.
P2P implementation of ranking algorithms:
The peer-to-peer methodology (P2P) utilizes the computing power of thousands of computers together to carry out one single task, often resulting in great speed-ups. Web ranking algorithms that, as it is well-known, form the core of many existing high-quality search engines, can benefit from a P2P appoach. In particular. It seems that a decentralized approach makes search results less susceptible to the influence of spam factories. In fact it is possible to assign a value of "trustworthiness" to peers with respect to their ranking of a single document.
A first research goal of our unit is to refine and implement some of the link analysis algorithms recently presented in the literature for the distributed computation of PageRank and similar ranking schemes, via a P2P appoach. In particular we want to extend these algorithms to the dynamic case, when peers enter and leave the system. This kind of research is also related to rank aggregation (a topic that, a it was succintly discussed, is quite relevant to nearest-neighbour computation and the aggregation of signals in general). Rank aggregation techniques are quite natural in this scenario since the final rank of a page will be obtained by merging the recommendations of several peers. It is particularly interesting to develop new techniques that are effective in real (i.e. dynamic) scenarios.



