Loren Data's SAM Daily™

fbodaily.com
Home Today's SAM Search Archives Numbered Notes CBD Archives Subscribe
FBO DAILY ISSUE OF MAY 14, 2004 FBO #0900
SOURCES SOUGHT

D -- Web Harvesting

Notice Date
5/12/2004
 
Notice Type
Sources Sought
 
NAICS
334111 — Electronic Computer Manufacturing
 
Contracting Office
Government Printing Office, Paper and Specialized Procurement and Sales Division, Contract Management Branch, 732 North Capitol Street, NW, Washington, DC, 20401
 
ZIP Code
20401
 
Solicitation Number
5119304
 
Response Due
6/2/2004
 
Archive Date
6/17/2004
 
Point of Contact
Sheila Williams, Contract Specialist, Phone 202-512-2010 xt 31503, Fax 202-512-1354,
 
E-Mail Address
swilliams2@gpo.gov
 
Description
The U.S. Government Printing Office (GPO) is planning on procuring services from a vendor that can provide a number of different products and/or services related to the harvesting of documents and publications from Web sites using Web crawler and data mining technologies. GPO is involved in a project that is attempting to discover and retrieve publications from Federal agency Web sites in order to identify publications that have not been cataloged by GPO but fall within the scope of the Federal Depository Library Program (FDLP). GSA Schedule holders please provide yours GSA Number. Background on the FDLP The FDLP was established by Congress to ensure that the American public has access to its Government's information. Since 1813, depository libraries have safeguarded the public's right to know by collecting, organizing, maintaining, preserving, and assisting users with information from the Federal Government. The FDLP provides Government information at no cost to nearly 1,250 depository libraries throughout the country and territories. These depository libraries, in turn, provide local, no-fee access to Government information in an impartial environment with professional assistance. GPO manages the Cataloging and Indexing Program and is responsible for maintaining the Catalog of United States Government Publications (CGP). The CGP is comprised of bibliographic records of U.S. Government information products published by all three branches of the U.S. Government that are included in the FDLP. Bibliographic records are added daily to the CGP, with approximately 22,000 records added annually. CGP links users directly from bibliographic citations to electronic publications by using PURLs (Persistent Uniform Resources Locators) or by assisting the public in locating information in depository libraries and through the GPO Sales Program. GPO bibliographic data is also available to individual libraries directly from GPO and from a variety of commercial sources. This data can be used to populate local databases and public access catalogs with bibliographic citations for U.S. Government publications. GPO prepares machine-readable cataloging records (MARC) for the Online Computer Library Center (OCLC) bibliographic network. The Cataloging Branch at GPO is the national authority for cataloging and bibliographic control of U.S. Government information products and is an active partner in all components of the National Program for Cooperative Cataloging. In addition, GPO prepares and adheres to the GPO Cataloging Guidelines, which provide specific guidance for cataloging complex and dynamic U.S. Government publications and are an essential resource for the Cataloging and Indexing Program. GPO?s Web Harvesting Project Over the past few years, GPO has become increasingly aware that many publications being published by Federal agencies are not being included in the FDLP; these documents have come to be known as ?fugitive publications?. With increasing frequency, agencies are publishing information only in electronic formats and, when this occurs, they frequently fail to inform GPO of these new publications for inclusion in the FDLP and CGP. In addition, agencies sometimes procure their printing directly from private sector companies or use in-house facilities rather than coming to GPO and then fail to inform GPO of these publications, although there may be electronic counterparts on the publishing agency Web sites that could and should be included in the FDLP and CGP. In light of the large number of publications that have become fugitive, GPO is seeking Web crawler and data mining technologies that can provide a solution for the identification and harvesting of fugitive documents and publications from agency Web sites. In order to begin, GPO plans to launch a pilot project with the Environmental Protection Agency (EPA) to crawl the primary EPA Web site and its sub-agency Web sites. The key capabilities GPO is seeking in relation to this project are as follows. The vendor must: 1. Provide Web crawling and data mining technology that will locate, identify and capture all publications from all pages on the EPA Web site and its sub-agency Web sites that fall within the scope of the FDLP. These technologies must be able to: Identify publications in all possible formats, such as HTML, PDF, MS Word and Excel files, etc. Crawl the content of each publication, as well as external and internal metadata tied to each file. 2. Based on criteria being developed by GPO for the characteristics that constitute a publication, build a set of rules and instructions for the crawler technology to capture only those documents that meet these criteria. This must include: The capability to refine and revise rules and instructions over time as GPO gets further along the learning curve. Automated elimination of those publications retrieved by the crawler that do not fall within the scope of the FDLP based on GPO?s set of criteria. 3. Provide automated comparison/collections analysis. Publications retrieved from the Web crawler and data mining technology will be matched against one or more publication databases provided by GPO, one of which will be based on MARC records cataloged for the FDLP and CGP. Technology must retain information in a database about all items previously harvested in order to avoid duplications from one crawl to another. Respones should be submitted to Sheila Williams via mail no later than June 2, 2004, 2:00 PM Eastern Standard Time. Please include your name, address, phone number and fax number.
 
Place of Performance
Address: United States Government Printing Office, 732 North Capitol Street, NW, Washington, DC
Zip Code: 20401
Country: USA
 
Record
SN00584899-W 20040514/040512212532 (fbodaily.com)
 
Source
FedBizOpps.gov Link to This Notice
(may not be valid after Archive Date)

FSG Index  |  This Issue's Index  |  Today's FBO Daily Index Page |
ECGrid: EDI VAN Interconnect ECGridOS: EDI Web Services Interconnect API Government Data Publications CBDDisk Subscribers
 Privacy Policy  Jenny in Wanderland!  © 1994-2024, Loren Data Corp.