Contenuto
Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca»Unità di ricercaINIZIO_TESTO_DA_INDICIZZARE
UNITA' DI RICERCA
italiano - english
Research program
New technologies and tools for the integration of Web search servicesUniversity Co-ordinator
Università degli Studi ROMA TRE - INFORMATICA E AUTOMAZIONE - ()Research Unit Leader
Riccardo TorloneDescription
The proposed project is devoted to realize the infrastructures for NGS services (Next Generation Search services) and is organized in five tasks:T1: Infrastructure design, focused on the design of an infrastructure for registering Web Services and Web Wrappers. Resources are registered together with their local schema and their input/output message tags are mapped on top of the concepts of a Global Ontology.
T2: Search-time support, supporting the execution of users' queries, which provides facilities for query submission and refinement and support of the join execution strategy and result materialization.
T3: Wrapper development, enabling the searching and extraction of information from Web data with a Web Service interface that imitates the interface of "conventional" search engines.
T4: Query reformulation, determining the set of services relevant to a user query and the conditions for their pairwise join. Reformulation considers the constraints expressed by the Global Ontology, the local schemas of the services, and the mappings between the Global and the local schemas.
T5: Search optimization, determining the "best" join execution strategy between XML fragments returned as results of search engines. This task is also responsible for defining the join methods and the performance metrics according to different application scenarios.
T1 and T2 will be jointly conducted by the three operative units. The remaining tasks are performed primarily by one operative unit; this unit is responsible of T3. In the following, we describe T1, T2, and then (more extensively) T3.
Task T1: Infrastructure design
All the content sources considered in this project are accessed by means of Web Services. It is reasonable to assume that the result of a Web Service call is an XML document whose structure is not only compliant with the WSDL of the invoked service, but also reflects some appropriate strategy for effectively "representing and publishing" the retrieved piece of information. However, a WSDL interface is purely syntactical and as such inappropriate for composition; this consideration motivates the research into the so-called "semantic Web", aiming at enriching Web Services with ontological content so as to support arbitrary Web Service composition.
Thus, "registering" one such Web Service means, essentially, describing the conceptual properties representing the content that can be extracted from the service, and then describing the meaning of each "output element" in terms of its tags and in terms of the "typical semantics" of an output element produced by the service. In particular, during the registration phase, Web Services are mapped to the NGS Global schema, represented by an Ontology, which we assume to be formulated in a W3C standard Ontology language, such as OWL or one of its less expressive variants. This is done by providing links from both the input and output specification of the Service to concepts of the Ontology.
In this research, we propose to define a generic scheme for registering Web Services that enables the storage of meta-data describing: (a) the Web Service syntax (request/response), (b) the semantics of tags in the request, (c) the semantics of tags in the response. Such meta-data are then linked to concepts of the Ontology describing the domain. The important aspect is that, as a result of such registration, it becomes possible to compare the "output elements" produced by two distinct services (or by subsequent invocations of the same service) by extending a simple equality testing to more complex reasoning tasks, which make, e.g., use of subsumption checking between concepts.
The "scalability" of our approach (i.e., ability to support multiple sources) depends on the ease of registration of a new service. Therefore, while in the first phase of the research the emphasis will be on providing effective manual linkage of a few services (e.g., Google, Amazon, DBLP, and some wrapped sources so as to support our test queries), we will next investigate the possibility of semi-automatic support for service registration by tools that rely on automated reasoning capabilities. To deal with the necessity of adapting and extending the Ontology so as to accommodate new information needs, we will draw from the experience gained in Information Integration, where good scalability is achieved by expressing local schemas as queries over the global schema, when several Data Sources are expected to be added. Analogously, in the context of NGS, one can, e.g., express the output of a given Service as a query over concepts of the Ontology.
In this context, two different policies for adapting and extending the Ontology can be considered: (i) A Service Driven Extension, occurring when newly registered Web Services refer to or are better represented by concepts which are not yet represented in the Ontology; in this case the Ontology is to be augmented so as to represent the semantic knowledge carried by the Web Service. (ii) A Query Driven Extension, occurring when user queries require information that does not yet have a semantic counterpart in the Ontology; in this case we may consider to add concepts to the Ontology so as to match newly expressed user needs, and to subsequently map Services to the Ontology whose output satisfies the newly added concepts.
In this research, we propose to study such policies in the context of NGS, and specifically to investigate how to support them through automated reasoning.
Task T2: Search-time support
This task consists of querying one source and then storing its results for subsequent processing. The query is performed by invoking a suitable Web Service (request part) and then managing the response and retrieving the results according to an interface that enables the partial loading of the first N entries of the result, where N is a parameter established at calling time. In general, this task is trivial if the results are provided as plain XML records and if the Web Service interface enables to control the number of entries returned as result, as in the cases of Google or Amazon Web Services.
However, in general, Web Services may not have sophisticated control provisions and they may return unbound amounts of information. In such case, the task has responsibility for making good use of available resources.
Moreover, this task is also responsible for managing the answers which are provided in formats other than XML records, and of aligning them to the standard format used for later processing of the query, while at the same time keeping the reference from the aligned record to the result returned by the service, that is probably of interest to the user.
An example of this functionality is the transformation required for "reading" a map provided in a graphical format with XML annotations regarding points on the map, and then for managing the semantics of specific queries, such as requesting the extraction of locations which are "within a given distance" from a given point. In such case, while the query could even be expressed graphically on the map, the system must able to respond not in terms of a graphic subset of the map, but rather in terms of the items which fall inside that area and represent "locations" (e.g. city names or zip codes), so as to enable the composition of this result with other results. Moreover, "closeness" to a point has to be used for ranking the results before putting them in an XML format which is compatible with the other partial results.
This task is also responsible for capturing the interaction with the users in order to improve the iterative execution of searches. It is well known that the interaction with search engines is typically an iterative process, where users perform several iterations of the search by altering the choice of input terms based upon the results of the previous iteration, until they are satisfied; normally, this process converges to a "better" result. Well known techniques of information retrieval allow the user to further condition the search by indicating, in the results, the elements that are either highly relevant or irrelevant.
We believe that user input may be very useful to improve search strategies, and therefore we plan to spend the final period of the project in experimenting and testing various alternatives for user's involvement. We plan to enable users to indicate which retrieved concepts better represent its intended meaning, and we plan to trace them back to the inputs being presented to given search services, so as to repeat such searches with improved input. The "tracing" can either be automatic or also be helped by users, by means of suitable interactions. In addition, interaction may be used to confirm conjectural matches (e.g., the matches of concepts such as "professor", "researcher", "author", if not explicitly supported by the vocabulary).
Task T3: Wrapper Development
The specific goal of our research unit is the study of techniques for developing Web Services specialized in the extraction of contents from data-intensive Web sites (e.g., wrappers of sites exposing bond quotes or the personnel of a given research institute). As discussed in Section 2.4 our starting point are the results obtained within the context of Roadrunner, our proposal for the automatic generation of Web wrappers. In this project we aim at overcoming current issues of the Roadrunner approach. In fact, it has been proved that Match, the core algorithm of Rodrunner, can produce exactly one solution in polynomial time for a specific class of languages, called Prefix Mark-up Languages [CrMe04]. However, unfortunately a large number of real-life of pages do not fall in this class of languages: for these pages the algorithm can fail, or produce low quality results, thus limiting the scalability of the approach. The causes that bring real-life Web page out of the class of Prefix Mark-up Languages are manifold. First, some pages, though fairly regular in regions that contain data of interest, may exhibit irregularities in other portions of the HTML code; typical examples are pages containing banners, advertisements or chunks of free-text. Second, since HTML is a language mainly used to define the visual presentation and organization of Web pages, it is deeply ambiguous, e.g. the same HTML tags are repeatedly used to mark completely different information; even in the most regular portions of HTML pages, the ambiguity of the language brings the source HTML code out of the class of Prefix Mark-Up Language. Finally, several real life Web pages contain disjunctive patterns, which on the contrary are not included in the class of Prefix Mark-Up Language.
To overcome these issue we believe that it is possible to complement our techniques with others inspired to those developed by Arasu and Garcia-Molina in the ExAlg approach. As discussed in Section 2.4 ExAlg is an algorithm that exploits on statistical features of a large collection of sample in order to infer the wrapper. Also ExAlg exhibit limitations that lead can it to produce low quality solutions when dealing with real life web pages. Namely, ExAlg can produce wrappers that can only "partially" some attribute; in a few words, many attributes are extracted by ExAlg within unstructured chunks of HTML text. Here the causes are related to the reliability of the statistics that are at the basis of the approach.
We have observed that Roadrunner and ExAlg work on complementary approaches: the former relies on local features, while the latter leverages on global properties. Our goal is to conciliate the two approaches in order to overcome issues that limit the applicability in practice of the two approaches alone.
In the project we aim at studying solutions to address the challenging issues of automatically generating wrappers for real life Web pages, overcoming the limitations of current approaches. The idea is to develop techniques to extend and support algorithms a la Roadrunner with techniques that dig out statistical information inspired to the ExAlg approach. The system we aim at developing should be able to (i) deal with pages containing local irregularities, (ii) solve the ambiguity of the HTML encoding, (iii) enhance the expressivity of the approach for inferring also disjunctive patterns.
We can sketch our approach as follows. First we will use global (statistic) features to produce coarse grain segments of the input pages; techniques to perform this task will be inspired by the approach developed in ExAlg. The produced segments will then be clustered in order to detect disjunctive patterns of irregular regions. Finally for each segment a wrapper will be recursively inferred, adopting the techniques developed in Roadrunner.
Workplan
The team will participate to the following deliverables:
Month 3: State-of-the-art
D11. Technology-oriented state-of-the-art concerning Web Services, including the choice of the Web services to be used in the project and of their ontological domains.
Month 6: Architecture Design
D12. Definition of the Web Service Registration platform architecture.
D21. Definition of the protocols for web service invocation and for storing partial results.
Month 8: Preliminary research results
D31 Definition of the wrapping interfaces and design of the first wrappers for selected data sources.
D32. Preliminary research results describing techniques for automatically inferring wrappers on real-life Web data sources
Month 12: Architecture Integration, First Phase
D22. First running prototype supporting two sources, simple join methods, and no user interaction.
Month 18: Production of Independent Research Results
D33. Design and development of advanced algorithms for the automatic inference of wrappers for Web data.
D23. Second running prototype supporting multiple sources, several join methods, and user interaction.
Month 24: Evaluation
D13. Experimentation and evaluation.



