istex

To build the base of the national digital scientific library.



The aims

The ISTEX platform: the base of scientific resources

The purchase of documentary archives is accompanied by the creation of a platform for their reception. This platform will also house all of the data acquired, i.e. several millions of digital documents.

It should be remembered…

The main aim of the ISTEX project is to offer the entire research and higher education communities an on line access to retrospective collections of scientific literature in all disciplines, engaging a national policy of massive acquisition of documents: journal archives, “databases”, text corpus etc.

Initially, access to the documentary resources will be via the publishers’ platforms, but the ISTEX platform, managed by the INIST-CNRS, will allow all of the data to be housed in a unique reservoir in normalised formats.

Transparency

The ISTEX platform will be transparent for its users: it will be integrated upstream of diffusion tools such as Digital Work Spaces, the thematic portals of the CNRS, or discovery tools (for example: Primo, Summon, EDS etc.).

Access to the ISTEX resources will thus be via the same entry points as standard subscriptions and/or as open access resources.

ISTEX thus creates a common, unified, standard, and normalised reservoir of documentary objects (scientific articles, book chapters, entries in encyclopaedias etc.) that are accessible via multiple and varied canals (OAI-PMH harvesting, widgets, API, etc.).

The ISTEX platform

Numerous advantages

With several million digital documents from all disciplines, the ISTEX platform will offer varied benefits to the users:

Access to a unique and exceptional corpus:

This reservoir of data, unique to its kind, will be distinguished by three major characteristics:

  • it will be the first to group together such a large volume of multidisciplinary and multilingual documents (several million resources) with a unique access and normalised formats;
  • it will contain whole resources (full text for each element) of diverse and varied typologies (retrospective collections of journals; electronic books; major corpus of digitized heritage documents; databases etc.);
  • these resources will be destined for multiple uses; equally to the ends of documentary research as to the ends of scientific material for research purposes.

A systematic access to the integral text of the document:

The ISTEX platform is not a descriptive base containing metadata that indicate the documents housed by publishers, but rather a database containing full texts to allow different but complimentary uses:

  • To no longer be dependent on external authorisations (e.g. the link towards a publisher’s portal) to access the full text of a document;
  • To be able to access documents with no time limitation;
  • To allow transversal treatments (automatic indexation, categorisation, extraction of knowledge) of all or a part of the base;
  • To be able to extract sub-corpora from the whole base according to criteria of discipline, type of document, dating, etc.

A powerful search engine adapted to the needs of scientists, with simple questioning and downloading:

In relation to the large volume of data and the demanding level of documentary research, the search engine must be powerful and robust but also progressive and open.

Furthermore, the multilingualism of the documents imposes complex and varied automatic language treatments.

The choice fell on an open source search engine (ElasticSearch) that allows the benefits of the tools developed by the community of the motor users. It will thus be easy to integrate functions such as lemmatisation[1] (treatment of bending moments), intelligent treatment of empty words and the addition of synonyms in the enquiries or in the or in the facets.

Services of data treatment: data extraction, text mining, production of documentary summaries and terminological corpora:

This immense reserve of textual data could serve as scientific material for applied research in different domains such as Automatic Language Treatment (TAL), but also for science history, or the production of gauges etc.

This reserve is completely integrated in the national landscape and allows directional exchanges with the other projects that are included in the ISTEX perimeter, either for resource enrichment or curation, or for their exploitation etc.

Services of data treatment: data extraction, text mining, production of documentary summaries and terminological corpora

An integration in the local digital environment allows comfortable navigation between the current resources and retrospective collections:

This reservoir of retrospective data is connected to current resources, and to contemporary distribution systems (digital space, portals, discovery tools). The platform is thus a base that will interface easily with existing portals, for example via API or widgets that can rapidly be plugged into the Content Management Systems (CMS) used by the distributers of electronic resources (SCD, CNRS, EPST etc.). An effort will be made to ensure that ISTEX metadata are deposited in commercial tools (discovery tools[2], link solvers[3]) to guarantee a continuity of research and access between current subscriptions and archives.

Remote access to all members of higher education and research establishments:

Access management will be operated by the distribution tools of organisations belonging to the ESR, according to the means of their choice and the technologies that they will set-up. The control of access to the ISTEX resources between the distribution tools (portal, ENT, etc.) and the ISTEX base operated at the INIST-CNRS will initially be by IP control and later by authentication. A demonstrator proposing an interface for questioning will be available for organisations that do not have their own means of distribution, and this will serve as the default portal.

A perennial access to retrospective acquired data:

A programme of perennial archiving will allow data to be conserved for several decades. This part of the project will be undertaken by a public organisation specialised in the domain (European national libraries).

Benefits for Higher Education and Research

This multidisciplinary platform is destined for researchers, academics, and other people involved in research and higher education. It will answer the requirements of users with several profiles: IT specialists seeking to question the API[4] of the platform (REST[5], OAI-PMH[6], Sparql[7], etc.) with the aim of corpus extraction e.g. to lead a research project.

Webmasters wishing to integrate the platform in their organisation’s ENT with the help of widgets associated to the platform (easy to install and directly wired to the platform via the web).

 

Members of the ESR who already have discovery and link resolving tools, and who wish to access the resources through this software.

Members of the ESR (researchers, documentalists, etc.) who wish to consult the platform’s resources through the demonstrator proposed on the official ISTEX Web site.

 

[1] The procedure allowing all lexical forms of a word to be regrouped towards the canonical form (e.g. the adjective “petit” exists in four forms: “petit, petite, petits and petites”)

[2] Meta search engine allowing a transversal search for information in several reservoirs

[3] Documentary tools allowing the link to be made between a source (generally metadata) and a target (generally a full on-line text)

[4] API: The programming interface that allows a specific functionality to be accessed via the network

[5] Type of architecture commonly used in the world of the web

[6] Procedure for the exchange of metadata

[7]Language of enquiries allowing access to data on the web

investissement d'avenir

Financement : ANR-10-IDEX-0004-02