Vai al contenuto| Home page|

   Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca»Unità di ricerca
INIZIO_TESTO_DA_INDICIZZARE

UNITA' DI RICERCA

italiano - english

Research program

New technologies and tools for the integration of Web search services
University Co-ordinator
Politecnico di MILANO - ELETTRONICA E INFORMAZIONE - ()
Research Unit Leader
Stefano Ceri
Description
The proposed project is devoted to realize the infrastructures for NGS services (Next Generation Search engines) and is organized in five tasks:

T1: Infrastructure design, focused on the design of an infrastructure for registering Web Services and Web Wrappers. Resources are registered together with their local schema and their input/output message tags are mapped on top of the concepts of a Global Ontology.

T2: Search-time support, supporting the execution of users’ queries, which provides facilities for query submission and refinement and support of the join execution strategy and result materialization.

T3: Wrapper development, enabling the searching and extraction of information from Web data with a Web Service interface that imitates the interface of “conventional” search engines.

T4: Query reformulation, determining the set of services relevant to a user query and the conditions for their pairwise join. Reformulation considers the constraints expressed by the Global Ontology, the local schemas of the services, and the mappings between the Global and the local schemas.

T5: Search optimization, determining the “best” join execution strategy between XML fragments returned as results of search engines. This task is also responsible for defining the join methods and the performance metrics according to different application scenarios.

T1 and T2 will be jointly conducted by the three operative units. The remaining tasks are performed primarily by one operative unit; this unit is responsible for T5. In the following, we describe T1, T2, and then (extensively) T5.

Task T1: Infrastructure design

All the content sources considered in this project are accessed by means of Web Services. It is reasonable to assume that the result of a Web Service call is an XML document whose structure is not only compliant with the WSDL of the invoked service, but also reflects some appropriate strategy for effectively “representing and publishing” the retrieved piece of information. However, a WSDL interface is purely syntactical and as such inappropriate for composition; this consideration motivates the research into the so-called “semantic Web”, aiming at enriching Web Services with ontological content so as to support arbitrary Web Service composition.

Thus, “registering” one such Web Service means, essentially, describing the conceptual properties representing the content that can be extracted from the service, and then describing the meaning of each “output element” in terms of its tags and in terms of the “typical semantics” of an output element produced by the service.
In particular, during the registration phase, Web Services are mapped to the NGS Global schema, represented by an Ontology, which we assume to be formulated in a W3C standard Ontology language, such as OWL or one of its less expressive variants. This is done by providing links from both the input and output specification of the Service to concepts of the Ontology.

In this research, we propose to define a generic scheme for registering Web Services that enables the storage of meta-data describing: (a) the Web Service syntax (request/response), (b) the semantics of tags in the request, (c) the semantics of tags in the response. Such meta-data are then linked to concepts of the Ontology describing the domain. The important aspect is that, as a result of such registration, it becomes possible to compare the “output elements” produced by two distinct services (or by subsequent invocations of the same service) by extending a simple equality testing to more complex reasoning tasks, which make, e.g., use of subsumption checking between concepts.

The “scalability” of our approach (i.e., ability to support multiple sources) depends on the ease of registration of a new service. Therefore, while in the first phase of the research the emphasis will be on providing effective manual linkage of a few services (e.g., Google, Amazon, DBLP, and some wrapped sources so as to support our test queries), we will next investigate the possibility of semi-automatic support for service registration by tools that rely on automated reasoning capabilities.
To deal with the necessity of adapting and extending the Ontology so as to accommodate new information needs, we will draw from the experience gained in Information Integration, where good scalability is achieved by expressing local schemas as queries over the global schema, when several Data Sources are expected to be added. Analogously, in the context of NGS, one can, e.g., express the output of a given Service as a query over concepts of the Ontology.

In this context, two different policies for adapting and extending the Ontology can be considered: (i) A Service Driven Extension, occurring when newly registered Web Services refer to or are better represented by concepts which are not yet represented in the Ontology; in this case the Ontology is to be augmented so as to represent the semantic knowledge carried by the Web Service. (ii) A Query Driven Extension, occurring when user queries require information that does not yet have a semantic counterpart in the Ontology; in this case we may consider to add concepts to the Ontology so as to match newly expressed user needs, and to subsequently map Services to the Ontology whose output satisfies the newly added concepts.

In this research, we propose to study such policies in the context of NGS, and specifically to investigate how to support them through automated reasoning.


Task T2: Search-time support

This task consists of querying one source and then storing its results for subsequent processing. The query is performed by invoking a suitable Web Service (request part) and then managing the response and retrieving the results according to an interface that enables the partial loading of the first N entries of the result, where N is a parameter established at calling time. In general, this task is trivial if the results are provided as plain XML records and if the Web Service interface enables to control the number of entries returned as result, as in the cases of Google or Amazon Web Services.

However, in general, Web Services may not have sophisticated control provisions and they may return unbound amounts of information. In such case, the task has responsibility for making good use of available resources.

Moreover, this task is also responsible for managing the answers which are provided in formats other than XML records, and of aligning them to the standard format used for later processing of the query, while at the same time keeping the reference from the aligned record to the result returned by the service, that is probably of interest to the user.

An example of this functionality is the transformation required for “reading” a map provided in a graphical format with XML annotations regarding points on the map, and then for managing the semantics of specific queries, such as requesting the extraction of locations which are “within a given distance” from a given point. In such case, while the query could even be expressed graphically on the map, the system must able to respond not in terms of a graphic subset of the map, but rather in terms of the items which fall inside that area and represent “locations” (e.g. city names or zip codes), so as to enable the composition of this result with other results. Moreover, “closeness” to a point has to be used for ranking the results before putting them in an XML format which is compatible with the other partial results.

This task is also responsible for capturing the interaction with the users in order to improve the iterative execution of searches. It is well known that the interaction with search engines is typically an iterative process, where users perform several iterations of the search by altering the choice of input terms based upon the results of the previous iteration, until they are satisfied; normally, this process converges to a “better” result. Well known techniques of information retrieval allow the user to further condition the search by indicating, in the results, the elements that are either highly relevant or irrelevant.

We believe that user input may be very useful to improve search strategies, and therefore we plan to spend the final period of the project in experimenting and testing various alternatives for user’s involvement. We plan to enable users to indicate which retrieved concepts better represent its intended meaning, and we plan to trace them back to the inputs being presented to given search services, so as to repeat such searches with improved input. The “tracing” can either be automatic or also be helped by users, by means of suitable interactions. In addition, interaction may be used to confirm conjectural matches (e.g., the matches of concepts such as “professor”, “researcher”, “author”, if not explicitly supported by the vocabulary).


Task T5: Search Optimization

Task T5 consists in the extension of classical query processing paradigms to the new task of “joining and matching” the partial results retrieved by means of Web Service calls. It is well known that “joins” enable the value-based matching of sub-query results; task T5 adapts well-known join schemes (e.g., nested loop, merge-scan, hash based) to a context where value-based identity is substituted by complex term matching. The join takes place among ranked lists of XML fragments, where each fragment represents one individual resource descriptor within the search result. Compatibility between two fragments is determined by matching given XML elements; these are known a-priori, as the XML schemas of Web Service responses are known; however, the matching of two fragments may be “stronger” or “weaker”, as it involves value and string comparison (e.g., identity, compatibility, and so on, possibly including the use of subsumption and term proximity). The matching itself thus yields to a third score, that is to be combined with the original ranking scores of the entries being matched, so as to produce the global score of the join result.

In such setting, building efficient optimization techniques amounts to building the most efficient techniques for providing the match results with highest ranking. In the simpler case of binary joins, the cost metrics for determining the join cost depend on the cardinalities of the results produced by each individual search and on the number of individual matching operations being required; effective optimization methods can be designed for given assumptions upon the query costs, e.g. by assuming that dominant costs are related to request-response executions or, instead, to ontology-based matching of result elements.

Given the proposed application – presenting results to a user who is interacting with a search engine platform – exhaustive search makes little sense, and instead “best matches” have to be searched efficiently, with mechanisms which are comparable to query processing techniques for the efficient determination of “top ranking” results. However, such “top rank” is deterministically defined in relational queries, while it is based on complex ranking schemes in our setting, and this of course makes the problem more difficult. Nevertheless, the results returned to users will be ordered according to the global score of the join operation, or at least in an approximation of this order. Indeed, suspending for a long time the visualization of “good” results only because a “better” one may show up afterwards is a strategy that is unlikely to satisfy the needs of the average user. Different degrees of approximation with respect to the order of extraction of the results (compared to the “theoretically optimal” order) may be studied and introduced.

The development of this task will consist initially in defining an abstract format for “query plans” and studying its formal properties – most likely, standard equivalence transformations which are well known in the context of join optimization, can be reused in this new setting. We will also define a variety of “match” operations; their effectiveness and cost will be studied as a function of the cost and complexity of accessing “ontological” resources and of using approximation and containment instead of resorting to simple equality.

Then, “join methods” will be defined, as a combination of (1) ordering the entries of the two operand results being inspected and (2) use of given match operation. Such a “logical” description will have to be complemented by a “physical” plan for storing intermediate and final results, and for “indexing” them according to the used vocabulary so as to enable the further use of join results within complex queries or in subsequent iterations of the query. The most promising join methods will be implemented and demonstrated.

Given join methods and query plans, we will continue the research in this task by focusing on optimization techniques, capable of (1) determining the join order where more than two sources are available (based upon simple heuristics), (2) selecting for each join the most promising join method, (3) arranging the dataflow and defining the data structures required for storage of intermediate and final results.

Each join method will be associated to a metric yielding its approximate cost and expected effectiveness, so that an optimizer will be able to choose the best method. Costs depend on two factors: the interaction with Web Services through the execution of request-response pairs, and the execution of several XML fragments comparisons after each new Web Service call. These recall the well-known join cost factors of join optimizers: i/o costs (now replaced by request/response pairs) and cpu costs (now replaced by the computational costs needed for joining XML fragments, when the matching requires textual comparison and/or domain-specific knowledge). In principle, the two costs adds up; however, we may easily foresee scenarios where the former or the latter cost factor dominates, and we will consider these extreme cases as most representative. The effectiveness of the overall strategy is also strictly related to several aspects which can hardly be taken into account, such as the dependability of the queried data sources, the estimated quality of the returned data, the average response time of the involved data providers. Some of these elements will be incrementally added to the supported metrics, based on their feasibility and relevance.

Data access plans produced by our method will be published in a clear format, and a designer-driven human intervention will be allowed in order to refine the access plans so as to better understand and tune the methods. As a follow-up of this research, we envision a system in which users may take advantage of proper wizards in order to provide valuable input for selection strategy. <br />
Workplan

The team will participate to the following deliverables:

D1.1. Technology-oriented state-of-the-art concerning Web Services, including the choice of the Web services to be used in the project and of their ontological domains (M3)
D1.2. Definition of the Web Service Registration platform architecture (M6)
D2.1. Definition of the protocols for web service invocation and for storing partial results (M6)
D5.1. Preliminary research results describing the abstract notation for defining joins among two sources and identification of the main join algorithms to be used in the project (M8)
D2.2. First running prototype supporting two sources, simple join methods, and no user interaction (M12)
D5.2. Design of advanced join methods and of complex query optimization (M18)
D2.3. Second running prototype supporting multiple sources, several join methods, and user interaction (M22)
D1.3. Experimentation and evaluation (M24)