Guidance: Data Security and Privacy

You are here


Federal regulations require IRBs to determine the adequacy of provisions to protect the privacy of subjects and to maintain the confidentiality of their data. To meet this requirement, federal regulations require researchers to provide a plan to protect the confidentiality of research data.  Today, the majority of data are at some point collected, transmitted, or stored electronically.  The purpose of this guidance is to help the research community develop best practices for managing electronic data. These best practices will need to adapt as technology evolves, so it is important that research teams keep current with the guidance and resources offered by the University. Researchers are expected to be proactive in designing and performing research to ensure that the dignity, welfare, and privacy of individual research subjects are protected and that information about an individual remains confidential. The protection of research data is a fundamental responsibility, rooted in regulatory and ethical principles and should be upheld by all data stewards.

Important Definitions

Anonymous: Anonymous data are collected in a manner where the identity of the subject cannot be determined by anyone at any time; not even the researcher. There are no links between the data and the individual person. Anonymous data is stripped of personally identifiable information (e.g., no names, student numbers, etc.). For example, online surveys (e.g., Qualtrics) are typically conducted anonymously (when the IP address is not stored). This includes any information that was recorded or collected without any of the 18 identifiers as defined by HIPAA.

Coded: Identifying information (such as name) that would enable the investigator to readily ascertain the identity of the individual to whom the private information or specimens pertain has been replaced with a code (number, letter, symbol, or any combination) and a key to decipher the code exists, enabling linkage of the identifying information to the private information or specimens.

Confidential: Confidential data does not mean the data is anonymous. For example, surveys collected in a face-to-face environment are typically labeled as confidential. Subjects may only participate in a research study when assured that the information they share will remain protected from disclosure outside of the research setting. The researcher agrees to collect, store, and share research data in a way that the information obtained about the research participant is protected and not improperly disclosed.

De-identified: All identifiers have been removed from the data set even though identifiers may still exist in a separate file. For example, the data set is de-identified and the master list containing names and de-identified codes are stored in a different location not easily accessible to the researcher or any other person. De-identification prevents a person’s identity from being connected with their responses.

Identifiable: This type of data includes personal identifiers and links associated with the data set. Identifiers include any information used to distinguish one person from another (e.g., personally identifiable information). These identifiers could be sensitive (e.g., medical information) or non-sensitive (e.g., public records or websites). Be careful about what identifiable information you collect from your research subjects.

PHI: Protected Health Information. "Individually identifiable health information, whether oral or recorded in any form or medium (e.g., narrative notes; X-ray films or CT/MRI scans; EEG / EKG tracings, etc.), that:

  • may include demographic information, and
  • is created or received by a ‘covered entity,’ that is, a health care provider, health plan, or health care clearinghouse, and
  • relates to the past, present, or future physical or mental health or condition of an individual, to the provision of health care to that individual, and/or to payment for health care services and
  • identifies the individual directly or contains sufficient data so that the identity of the individual can be readily inferred." (Source: HHS: Protecting Personal Health Information in Research)

PII: Personally Identifiable Information: “(1) any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information.” (Source: OMB Memorandum M-07-16: Safeguarding Against and Responding to the Breach of Personally Identifiable Information)

Private Information: Information about behavior that occurs in a context in which an individual can reasonably expect that no observation or recording is taking place, and information that has been provided for specific purposes by an individual and that the individual can reasonably expect will not be made public (for example, a medical record). Private information must be individually identifiable (i.e., the identity of the subject is or may readily be ascertained by the investigator or associated with the information) in order for obtaining the information to constitute research involving human subjects.

Sensitive Research Data: Data are considered sensitive when disclosure of identifying information could have adverse consequences for subjects or damage their financial standing, employability, insurability, or reputation.

Investigator Responsibilities

Anyone who conducts research with human subjects at Lehigh University has a responsibility to protect the data collected and used for their research. This is especially important when the data (a) contain personal identifiers or enough detailed information that the identity of participating human subjects can be inferred, (b) contain information that is highly sensitive, or (c) are covered by a restricted use agreement. The guidelines below are intended to help researchers understand when and how to use the most effective and efficient methods for storing and analyzing confidential research data so that those data are adequately protected from theft, loss or unauthorized use.

The Principal Investigator (PI) is responsible for ensuring that research data are secure when it is collected, stored, transmitted, or shared.  All members of the research team should receive appropriate training about securing and safeguarding research data. For example, the research team should understand they need to document their standard practices for protecting research data so that they can provide these details to the IRB if a mobile device is lost or stolen. Data security should be discussed regularly at research team meetings. Researchers must provide sufficient information concerning data security procedures as part of the IRB human subjects application. Depending on the level of possible risk to participants, the IRB may require additional data safety procedures to be implemented prior to IRB approval.  

As a general practice, researchers working with human subjects should avoid collecting personally identifiable information (PII) whenever possible. Perhaps the best way to protect a research subject’s identity is by not knowing that identity in the first place. However, in many cases, the collection of PII is necessary for carrying out a research project.

There are many ways in which PII arises in the normal course of conducting research. If subjects sign informed consent agreements, their signatures are identifying information that must be securely stored. If subjects are awarded a prize or paid for their participation in a study, the researcher needs enough identifying information to enable delivery of the payment or prize. In some cases, researchers may need to merge data from different sources (e.g., survey responses and biological data), a step that can only be carried out with some form of personal identifier. Likewise, longitudinal studies usually require storage of detailed personal identifiers so that subjects can be contacted for subsequent interviews over long periods of time.

Asssessing Necessary Data Security Methods

Based on the type of data involved in the study, the IRB is required to 1) assess potential risks to participants, and 2) evaluate the researchers’ plan to minimize risks.  All research activities result in some type of risk and the researcher has the responsibility to mitigate the risk of improper disclosure.
What is the risk?
  • Is the data identifiable, de-identified (coded), or anonymous?
  • Is sensitive information being collected that could result in harm to participants?
  • What is the risk of harm to the participant or others?
What are the protections against anticipated threats or hazards (during collection, transmission, storage)?
  • Encryption of data on device to protect against loss/theft of device
  • Use of secure data transmission channels to protect against data interception
  • Strong passwords to protect against unauthorized access
  • Store data behind a secure firewall whenever possible
  • Ensure strong data security controls on all storage sites


Encryption protects data by encoding information so that only authorized parties may read it. Encryption can occur “at-rest” where the data are being stored and “in-transit” as the data are being moved from one location to another. There are many tools and methods available to encrypt all types of data; for more information, please visit the Lehigh LTS site on encryption & data security

What is Personally Identifiable Information (PII)?

PII is defined as information that is uniquely associated with an individual person. The HIPAA privacy rules identify 18 items (such as name, mailing address, email address, social security number, etc.) that are considered to be forms of PII. While the list is regarded as comprehensive, it is not necessarily exhaustive.

Inferring the Identity of Research Subjects

It is sometimes possible to infer the identity of someone participating in a research study even when the data for the study do not contain any explicit identifiers such as those listed above. For example, by cross-referencing certain variables such as state of residence, occupation, education, age, sex, and race, it might be possible to infer the identity of a research subject. As such, the absence of personal identifiers from a research data set does not obviate the need for secure storage and protection. Similarly, when research data sets are being made available for public use, the data need to be stripped of all personal identifiers and coded in a manner that does not allow anyone to infer the identity of a subject. This is often a difficult task because the identity of individuals can be inferred by using data sets from multiple sources.  The proliferation of public use datasets and publicly available records has increased the odds of being able to infer someone’s identity by merging multiple data sources through a phenomenon known as the Mosaic effect. Researchers who produce or share anonymous public use data files need to consider whether the data they are using or releasing could be used in combination with other publicly available data to infer individual identities. Researchers are encouraged to consult the Research Integrity office to determine if their proposed research involves human subjects. 

Research Involving Audio or Video Recordings

Voice and/or video recordings of research participants are considered identifiable data, even if names or other individually identifiable information is not specifically solicited as part of the research protocol. Any recording of participants for research purposes must be disclosed as part of the informed consent process, with participants providing their consent to the recording, storage, and use of this information. Recordings must be saved to a secure, encrypted location, with access restricted only to authorized research personnel through the use of a strong and unique password. Recordings should be maintained for as brief a duration as necessary to achieve research goals, and deleted as soon as possible. For more information on the use of audio or video in research, please see our guidance on Audio and Video Conferencing in Human Subjects Research

Highly Sensitive Data

Research data are considered highly sensitive when there is a heightened risk that disclosure may result in embarrassment or harm to the research subject. Data on topics such as sexual behavior, illegal drug use, criminal behavior, crime victimization or mental health are considered highly sensitive. Information that could have adverse consequences for subjects or damage their financial standing, employability, insurability, or reputation should be adequately protected from public disclosure, theft, loss or unauthorized use, especially if it includes PII.

Restricted Use Agreements

Many researchers at Lehigh University receive data from outside agencies or institutions that are subject to restricted use agreements (also called data sharing agreements). These are legal contracts that impose restrictions on the researchers’ use of the data and sometimes include detailed procedures for secure storage, restricted access and analysis of the data. As part of the agreement, certain government agencies may also visit the researcher (or “licensee”) to conduct a compliance audit. In other cases, restricted use agreements may simply prevent public release of the data or sale of the data to a third party.  But in cases where an agreement does not specify data security procedures, researchers must consider the need to keep their data secure so that the potential for harm to any individuals or organizations is minimized. When faced with two sets of data security requirements (e.g., one from the Lehigh University IRB and one from a data sharing agreement), the researcher should always default to the requirements with higher standards for data protection. 

PII Data from Open Public Records

Researchers who work with open public records that contain PII (e.g., voter registration files, telephone directories, occupational license registries, property tax records, firearms registries, criminal records) may not meet the regulatory definition of research involving human subjects. However, researchers are advised to use caution when dealing with public records data that contain sensitive information. Merging and publishing sensitive information from publicly available records has the potential to embarrass or harm individuals described in the records even though the information is already public.  Researchers are encouraged to consult the Research Integrity office to determine if their proposed research involves human subjects and whether risk of harm has be adequately minimized.

Public vs. Private Internet Data

The Common Rule defines "private information" as "information about behavior that occurs in a context in which an individual can reasonably expect that no observation or recording is taking place, and information which has been provided for specific purposes by an individual and which the individual can reasonably expect will not be made public (for example, a medical record)" (Source: 45 CFR 46.102(e4)).
For data obtained from internet sources, such as social media or online discussion boards, the distinction between "private" vs. "public" information is often ambiguous. In order to determine whether data obtained from online sources can be considered public or private information, a number of factors should be evaluated:
  • Has the information been posted / shared by an individual freely without restrictions on its access or use? Or was the information originally shared under a reasonable assumption of privacy / confidentiality?
  • Do the Terms of Service for the source website expressly permit (or prohibit) the use of website data for research purposes?
  • Does the act of collecting or compiling this information for research purposes itself represent a harm or risk to the individual or the original source of the information?
  • Is there a clear "norm" or expectation of privacy for the source of information being collected? For example, members of online communities may share personal information about themselves with the expectation that this information is intended only for the other members of the group. Such information may be "publicly" available but should be considered private information for the purposes of research.
  • If the information were to be shared widely outside of its original context, would this introduce new privacy concerns for the individuals who originally shared this information?
Researchers are encouraged to consult the Research Integrity office for assistance in assessing the public vs. private nature of their proposed online data. 
For more information on the ethical review and conduct of internet research, please see the HHS guidance “Considerations and Recommendations Concerning Internet Research and Human Subjects Research Regulations

Public Use Data Files

Public use data files are files from which all PII has been removed and the data are coded in such a way as to make identification of research subjects extremely unlikely.  Researchers who work with public use data sets that do not contain PII may not meet the regulatory definition of research involving human subjects, However some restricted use agreements nevertheless require local IRB review. As such, researchers are encouraged to consult the Research Integrity office to determine if their proposed research requires IRB review.

"Anonymous" or "Confidential"

Privacy is about people. Confidentiality is about data.

Per HHS and FDA Regulations, 45 CFR 46.111(a)(7) and 21 CFR 56.111(a)(7)), the IRB shall determine that where appropriate, there are adequate provisions to protect the privacy of subjects and to maintain confidentiality of data in order to approve human subjects research. The committee must consider the sensitivity of the information collected and the protections offered the subjects. Privacy and confidentiality are also supported by two principles of the Belmont Report:

  • Respect for persons – Individuals should be treated as autonomous agents able to exercise their autonomy to the fullest extent possible, including the right to privacy and the right to have private information remain confidential.
  • Beneficence - Maintaining privacy and confidentiality helps to protect participants from potential harms including psychological harm such as embarrassment or distress; social harms such as loss of employment or damage to one‘s financial standing; and criminal or civil liability.

Maintaining privacy and confidentiality helps to protect participants from potential harms including psychological harm such as embarrassment or distress; social harms such as loss of employment or damage to one‘s financial standing; and criminal or civil liability. Especially in social/behavioral research, the primary risk to subjects is often an invasion of privacy or a breach of confidentiality.

Privacy is the control over the extent, timing, and circumstances of sharing oneself (physically, behaviorally, or intellectually) with others. For example, persons may not want to be seen entering a place that might stigmatize them, such as a pregnancy counseling center clearly identified by signs on the front of the building. The evaluation of privacy also involves consideration of how the researcher accesses information from or about potential participants (e.g., recruitment process). IRB members consider strategies to protect privacy interests relating to contact with potential participants, and access to private information.

Privacy is:

  • About people
  • A sense of being in control of access that others have to ourselves
  • A right to be protected
  • Is in the eye of the participant, not the researcher or the IRB.

Confidentiality pertains to the treatment of information that an individual has disclosed in a relationship of trust and with the expectation that it will not be divulged to others without permission in ways that are inconsistent with the understanding of the original disclosure.
During the informed consent process, if applicable, subjects must be informed of the precautions that will be taken to protect the confidentiality of the data and be informed of the parties who will or may have access (e.g., research team, FDA, OHRP). This will allow subjects to decide about the adequacy of the protections and the acceptability of the possible release of private information to the interested parties.


  • Is an extension of privacy
  • Is an agreement about maintenance and who has access to identifiable data
  • In regards to HIPAA, protects patients from inappropriate disclosures of Protected Health Information (PHI).

This guidance was adapted with permission from the Princeton University "Research Data Security" and the University of Pittsburgh "Electronic Data Security" webpages.