SAMDaily.us | PRESOL | 70 | Detecting PII in Free Text

SAMDAILY.US - ISSUE OF MARCH 12, 2023 SAM #7775
SOLICITATION NOTICE

70 -- Detecting PII in Free Text

Notice Date: 3/10/2023 7:24:08 AM
Notice Type: Solicitation
NAICS: 513210 —
Contracting Office: HHS CDC ATLANTA GA 30333 USA
ZIP Code: 30333
Solicitation Number: CDCHCPC2023-73222
Response Due: 3/17/2023 9:00:00 AM
Archive Date: 04/01/2023
Point of Contact: Kelly Parker, Dawn Redman
E-Mail Address: UIT4@CDC.GOV, pwj5@cdc.gov
(UIT4@CDC.GOV, pwj5@cdc.gov)
Description: National Center for Health Statistics (NCHS) data are to be modernized across the full spectrum of data science needs. Expanding from traditional statistics, NCHS is exploring new modalities for data release including data in unformatted free text. Free text expands the scope of NCHS data releases and allows researchers both within the Center and externally new opportunities for advanced analytics. However, prior to the release of free-text, privacy and disclosure safeguards must be put in place. Identifying and remediating personally identifying information (PII) in free-text is a common problem across the Center, including Vitals though death certificates, the Research and Development Survey (RANDS) and electronic health records (EHR)s. Identification of PII is a prerequisite to move data from secured locations to outside contractors, a research data center (RDC) and ultimately to the public. For internal use, NCHS needs to perform downstream analysis on the text, including natural language processing (NLP). PII must be removed from the models to mitigate bias. Machine learning or AI trained on text data should not learn to reproduce statistics reproducing behavior based off direct identifiers. NCHS seeks a vendor to supply software that meets the specifications outlined under Tasks to be Performed. The vendor must provide NCHS with a secure mechanism to achieve its objective of detecting and removing PII in free-text. This includes, but is not limited to, large pieces of well structured text like a transcribed recorded interview, short snippets like social media, notes from EHRs, or the mentioned death certificates, or VERY short snippets like a column of (potential PII) in a table of data. We can not provide documents as the data are considered PII.� All text will be UTF-8 text, there is no OCR needed for handwriting. Successful software would include a working out-of-the-box solution to start that would not require extensive customization or for a solution that needs development of a custom model. All required detection data (format, structure, etc) will be provided day one. SECTION 4 � TASKS TO BE PERFORMED The vendor is required to have the following minimum features for the software solution: A python library or Docker image that can deployed internally, nothing executed remotely Data sources are dynamic The library/image would be self-contained and would not call out or make external connections (e.g., no cloud solutions). The system considered will need unlimited usage. It MUST also be on-premises due to security concerns. ��It MUST be containerized or can be run in our current cloud system (currently Azure). The library/image would be accessible through a local Application Programming Interface (API). Need an API endpoint for feeding in text. File systems are irrelevant,,� Allow to feed in text and receive a response from the API with the identified, redacted, or synthetic replaced data back. This API must be deployed on premises, in a secure offline environment. Need an internal API to poll as needed.� Not looking for a scanner. The library/image would provide detection of names, ages, mentions of drugs, injuries, medical conditions, organizations (including health insurance providers and hospitals), locations, and generic dates, and dates of birth. The library/image would allow documented accuracy of > 99% for the listed named entities. All components to include, accuracy, precision and recall are paramount and the focus and emphasis is on recall vs precision to avoid risks of identifying or inadvertant disclosure of private information to avoid legal penalties. The efficacy of the system using an internally held out set of examples that are in-context with NCHS data and represent standard examples and challenging edge cases. The library/image would allow the replacement of detected PII with synthetic examples in the output. We are most concerned with the replacement task for PII not PHI, this is names, addresses, DOBs, organizations, etc� We do not need synthetic replacement for PHI (but we still need detection). Synthetic replacement would replace a name, DOB, organizations, or other PII with a syntactically similar result. For example, �Patient John Stilgar died on 7/15/2020 at Mercy Hospital� could be replaced with �Tyrone Simmons died on 3/14/2022 at West Lake Hospital�. It is most important that names and dates are identified and replaced, while organizations less so. The library/image will allow the detect of named entities in additional languages other than English, including Spanish. Redacting data is of interest. The desired system would provide a JSON response with the original text, redacted text, replacement text, and each PII/PHI entity location in the original text string. �This would not include machine language translation in addition to extracting text and identifying entities as that would be out of scope.
Web Link: SAM.gov Permalink
(https://sam.gov/opp/bb182deb10824bfa8579a82d31ee1c15/view)
Place of Performance: Address: Hyattsville, MD 20782, USA; Zip Code: 20782; Country: USA
Record: SN06615192-F 20230312/230310230108 (samdaily.us)
Source: SAM.gov Link to This Notice
(may not be valid after Archive Date)

| FSG Index | This Issue's Index | Today's SAM Daily Index Page |

Loren Data's SAM Daily™

70 -- Detecting PII in Free Text