Text and Data Mining
Need more help?
Finding data
Below is an alphabetical list of potential sources for datasets. This list is not comprehensive. Most of the resources in this list are library-licensed content available only to ASU researchers.
We have also provided links to several search tools you can use to locate additional data sources.
-
Google Dataset Search betaGoogle's Dataset Search can be used to find datasets hosted in thousands of repositories using a simple keyword search.
-
DATA.govThe powerful central hub for open government data, provides services that include data visualization, mapping tools, context to help locate and understand the data, and robust Application Programming Interface (API) access for developers.
American Physical Society
American Physical Society makes data sets available for research under specific restrictions and upon request. Researchers must accept terms and conditions governing the use of the data sets. For more information, including terms and conditions and instructions for requesting data sets see APS Data Sets for Research.
Datasets contain only Table of Content level metadata and article/citing-article pairs.
Annual Reviews
Text and Data Mining is permitted for authorized users for legitimate academic research and other educational purposes. A request must be made to Annual Reviews to gain access.
To start the request process, please contact ASU Library Researcher Support.
Cambridge
Text and data mining is permitted by authorized users for non-commercial research purposes only. Specific requirements and restrictions can be found in the Cambridge Terms of Use.
Early English Books Online (EBBO)
The Early English Books Online Text Creation Partnership (EEBO TCP) was a project to create textual transcriptions of the EEBO digitized images. It produced thousands of accurate, searchable, full-text transcriptions of early print books that are now available to everyone.
The raw files are available to download from Dropbox folders:
For more information about these collections, see The Text Creation Partnership.
HathiTrust
The HathiTrust Research Center (HTRC) facilitates non-profit research and educational uses of the HathiTrust Digital Library corpus. The focus is on non-consumptive research, remaining within the bounds of fair use under U.S. Copyright Law. HTRC, jointly developed by Indiana University and the University of Illinois, has developed a suite of tools and services for TDM. These include web-based algorithms, freely-accessible datasets, and secure computing capsules.
Many of the tools and services require researchers to create an account on HTRC Analytics. These accounts are limited to those with research and teaching uses from not-for-profit and educational institutions.
-
HathiTrust Digital Library
Find digitized books, journals, and government documents with full-text search and access to public domain materials from academic and research institutions.
Access Note: Full-text viewing and downloading are available only for materials in the public domain or with specific permissions. Copyrighted items may be searchable but not viewable. ASU users should log in via the Institutional Login to access and download eligible materials.
IOP
ASU Researchers are permitted to conduct TDM using IOP Publishing journals that ASU Library currently subscribes to. TDM using subscribed journals is permitted for non-commercial purposes only. TDM using content published under a Creative Common license is governed by the terms of the license for the particular article.
Researchers who wish to conduct TDM should first read through the IOP Text and Data Mining (T&DM) Policy. Next, they must email the following information to textmining@iop.org:
- their full name;
- the name of the institution to which they are affiliated;
- the titles of the journals that they wish to mine and, where relevant, the years/issues;
- the approximate length of time that they need to have access to the IOPscience website to carry out the T&DM; and
- the purpose for which they want to T&DM.
New York Philharmonic Open Data
The New York Philharmonic has made the following Open Data collections available for research. For more information and to access datasets, please see Open Data at the New York Philharmonic.
Performance History
The Performance History database from the New York Philharmonic documents all know concerts of the New York Philharmonic, the New York Symphony, and the New/National Symphony from December 7, 1842. This accounts for more than 20,000 performances. The metadata has been released under the Creative Commons Public Domain CC0 license and can be located on the New York Philharmonic's GitHub page.
Subscriber's Project and Subscribers Database
The Subscriber's Project database contains the names, addresses, and seat locations for Philharmonic subscribers dating back to the 19th century and contains more than 500,000 subscriber records. (To protect privacy, post-1953 subscriber names are not searchable.)
This data is available in two ways: as a set of downloadable CSV files or in a searchable subset of subscribers between 1883 and 1907.
ProQuest
ProQuest allows data mining from historical and primary source content that the ASU Library has acquired with perpetual access.
For data mining access, data sets must also be purchased from ProQuest. Due to a number of factors, the current turnaround time for data set production is approximately six months. The cost for data sets varies, but ranges from approximately $250-$500 per set. An invoice will not be generated until the data set has been prepared and is ready to deliver. Data sets can be delivered in one of three ways:
- AWS (Amazon Web Service)
- Digital File Transfer (from one FTP site to the institution's dedicated FTP site)
- Hard disk drive (physical hard drive delivery)
To learn what data is available and start the request process, contact ASU Library Researcher Support.
SAGE Journals
Text and data mining is permitted by authorized users as long as the use is non-commercial and users only download articles to which they have legitimate access. Users are required to agree to the terms of Sage's Standard Text and Data Mining License.
Sage places limits on the number of articles that can be downloaded within a specific time frame and recommend the use of an API. More information is available at Text and Data Mining on Sage Journals.
ScienceDirect (Elsevier)
Researchers from subscribing institutions are permitted to text mine for non-commercial research purposes using Elsevier's application program interface (API). To obtain a personal API key, researchers self register through the developers portal. Full text content is available in XML format.
Elsevier's Object Retrieval API can be used by researchers to text mine images and other objects associated with an article.
Open Access content is available to mine via the Full Text API, following the requirements of each article's user license.
For more information please see Elsevier's Text and Data Mining Policy.
Taylor & Francis
Taylor and Francis recently updated their Terms and Conditions to permit TDM on ASU subscribed content as well as open access content, with no additional charge, provided the TDM is non-commercial in nature. Find more information about the TDM policy at Text and Data Mining.
Before starting TDM activity, please contact Taylor and Francis at support@tandfonline.com, with your ASU affiliation and a brief overview of your planned project. This is to ensure that they can provide any needed access and support.
Wiley
To support platform stability and user security, Wiley asks that users only access content for TDM purposes using an approved API service. To receive an API key, researchers must agree to Wiley's TDM license. More information and a link to the TDM license can be found in Wiley's Text and Data Mining Policy.
World Scientific Publishing
Text and Data Mining is permitted; however, World Scientific Publishing does not provide a specific API nor do they make datasets available. Bot prevention technology is in place and access to the platform will automatically be cut off following 200 downloads in a short time.
Before attempting to conduct TDM in the World Scientific Publishing platform please contact ASU Library Researcher Support. We will help coordinate access on your behalf.