Skip to main content
LibApps staff login

Text and Data Mining

Text and Data Mining resources from the ASU Library

Finding Data

Below is an alphabetical list of potential sources for datasets.  This list is not comprehensive. Most of the resources in this list are library-licensed content available only to ASU researchers.

We have also provided links to several search tools you can use to locate additional data sources.

American Physical Society

American Physical Society makes data sets available for research under specific restrictions and upon request.  Researchers must accept terms and conditions governing the use of the data sets.  For more information, including terms and conditions and instructions for requesting data sets see APS Data Sets for Research.   

Datasets contain only Table of Content level metadata and article/citing-article pairs. 

Annual Reviews

Text and Data Mining is permitted for authorized users for legitimate academic research and other educational purposes.  A request must be made to Annual Reviews to gain access.

To start the request process, please contact ASU Library Researcher Support.

Cambridge

Text and data mining is permitted by authorized users for non-commercial research purposes only. Specific requirements and restrictions can be found in the Cambridge Terms of Use

Early English Books Online (EBBO)

The Early English Books Online Text Creation Partnership (EEBO TCP) is an ongoing project that is creating textual transcriptions of the EEBO digitized images.  The work is being done in two phases and with different availability and instructions for each phase.

EEBO-TCP Phase I

More than 25,000 texts are freely available for anyone to use without restriction.  The texts may be used for text mining, they can be modified, and they can be shared with others. The raw files can be downloaded as gzipped tarballs through this Box.com folder: EEBO Phase 1 bulk files.

EEBO-TCP Phase II

Phase II is ongoing currently with just under 35,000 titles released by 2019.  Additional titles will be released later in the year.  Arizona State University is a partner library, which means ASU researchers have access to use these texts subject to the restrictions in the ASU Library license.  The files may be downloaded for local use, but may not be shared or redistributed to users at non-TCP partner institutions without permission. To request access to download these files, send an email to tcp-info@umich.edu.  Once you have been verified as an authorized ASU researcher, you will receive instructions for accessing the files.

The restrictions on Phase II files will end on or about January 1, 2021. Once restrictions have been removed, the Phase II texts will be freely available for anyone to use without restriction.

Other TCP Digital Collections

Eighteenth Century Collections Online (ECCO-TCP) includes the fill text of about 3,000 books available freely to anyone. Evans Early American Imprints (Evans-TCP) includes the full text of about 5,000 books available freely to anyone. Like the EEBO-TCP Phase I files, the ECCO-TCP and Evans-TCP files are available for bulk download through Box.com folders.

See The Text Creation Partnership for more information about these collections.

HathiTrust

The HathiTrust Research Center (HTRC) facilitates non-profit research and educational uses of the HathiTrust Digital Library corpus.  The focus is on non-consumptive research, remaining within the bounds of fair use under U.S. Copyright Law. HTRC, jointly developed by Indiana University and the University of Illinois, has developed a suite of tools and services for TDM.  These include web-based algorithms, freely-accessible datasets, and secure computing capsules.

Many of the tools and services require researchers to create an account on HTRC Analytics.  These accounts are limited to those with research and teaching uses from not-for-profit and educational institutions. 

IOP

ASU Researchers are permitted to conduct TDM using IOP Publishing journals that ASU Library currently subscribes to. TDM using subscribed journals is permitted for non-commercial purposes only. TDM using content published under a Creative Common license is governed by the terms of the license for the particular article.

Researchers who wish to conduct TDM must email the following information to textmining@iop.org:

  • their full name;
  • the name of the institution to which they are affiliated;
  • the titles of the journals that they wish to mine and, where relevant, the years/issues;
  • the approximate length of time that they need to have access to the IOPscience website to carry out the T&DM; and
  • the purpose for which they want to T&DM.

Before submitting this email, please read through the IOP Text and Data Mining (T&DM) Policy for details.

New York Philharmonic Open Data

The New York Philharmonic has made the following Open Data collections available for research.  For more information and to access datasets, please see Open Data at the New York Philharmonic.

Performance History

The Performance History database from the New York Philharmonic documents all know concerts of the New York Philharmonic, the New York Symphony, and the New/National Symphony from December 7, 1842.  This accounts for more than 20,000 performances. The metadata has been released under the Creative Commons Public Domain CC0 license and can be located on the New York Philharmonic's GitHub page.

Subscriber's Project and Subscribers Database

The Subscriber's Project database contains the names, addresses, and seat locations for Philharmonic subscribers dating back to the 19th century and contains more than 500,000 subscriber records. (To protect privacy, post-1953 subscriber names are not searchable.)

This data is available in two ways: as a set of downloadable CSV files or in a searchable subset of subscribers between 1883 and 1907.

ProQuest

ProQuest allows data mining from historical and primary source content that the ASU Library has acquired with perpetual access. 

For data mining access, data sets must also be purchased from ProQuest. Due to a number of factors, the current turnaround time for data set production is approximately six months. The cost for data sets varies, but ranges from approximately $250-$500 per set. An invoice will not be generated until the data set has been prepared and is ready to deliver. Data sets can be delivered in one of three ways:

  • AWS (Amazon Web Service)
  • Digital File Transfer (from one FTP site to the institution's dedicated FTP site)
  • Hard disk drive (physical hard drive delivery)

To learn what data is available and start the request process, contact ASU Library Researcher Support.

SAGE Journals

Text and data mining is permitted by authorized users as long as the use is non-commercial and users only download articles to which they have legitimate access. Users are required to agree to the terms of their Standard Text and Data Mining License

Sage places limits on the number of articles that can be downloaded within a specific time frame and recommend the use of an API.  More information is available at Text and Data Mining on Sage Journals.

ScienceDirect (Elsevier)

Researchers from subscribing institutions are permitted to text mine for non-commercial research purposes using Elsevier's application program interface (API). To obtain a personal API key, researchers self register through the developers portal. Full text content is available in XML format.

Elsevier's Object Retrieval API can be used by researchers to text mine images and other objects associated with an article.

Open Access content is available to mine via the Full Text API, following the requirements of each article's user license.

For more information please see Elsevier's Text and Data Mining Policy.

Taylor & Francis

Taylor and Francis recently updated their Terms and Conditions to permit TDM on ASU subscribed content as well as open access content, with no additional charge, provided the TDM is non-commercial in nature. Researchers are required to adhere to the terms and conditions outlined in the Taylor & Francis STM Model License

Before starting TDM activity, please contact Taylor and Francis at support@tandfonline.com, with your ASU affiliation and a brief overview of your planned project. This is to ensure that they can provide any needed access and support.

Wiley

To support platform stability and user security, Wiley asks that users only access content for TDM purposes using an approved API service.  To receive an API key, use must agree to Wiley's TDM license. More information and a link to the TDM license can be found in Wiley's Text and Data Mining Policy.  

World Scientific Publishing

Text and Data Mining is permitted; however, World Scientific Publishing does not provide a specific API nor do they make datasets available.  Bot prevention technology is in place and access to the platform will automatically be cut off following 200 downloads in a short time.  

Please contact the ASU Library Researcher Support before attempting to conduct TDM in the World Scientific Publishing platform so we can coordinate access on your behalf.

The ASU Library acknowledges the twenty-three Native Nations that have inhabited this land for centuries. Arizona State University's four campuses are located in the Salt River Valley on ancestral territories of Indigenous peoples, including the Akimel O’odham (Pima) and Pee Posh (Maricopa) Indian Communities, whose care and keeping of these lands allows us to be here today. ASU Library acknowledges the sovereignty of these nations and seeks to foster an environment of success and possibility for Native American students and patrons. We are advocates for the incorporation of Indigenous knowledge systems and research methodologies within contemporary library practice. ASU Library welcomes members of the Akimel O’odham and Pee Posh, and all Native nations to the Library.