Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove flagged attributes from capture base #31

Open
mitfik opened this issue Jan 10, 2024 · 16 comments
Open

Remove flagged attributes from capture base #31

mitfik opened this issue Jan 10, 2024 · 16 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request v2

Comments

@mitfik
Copy link
Contributor

mitfik commented Jan 10, 2024

Current spec:

Any attributes defined in a Capture Base that may contain identifying information about entities (i.e., personally identifiable information (PII) or quasi-identifiable information (QII)) can be flagged.

The problem:

  • attaching flagged attribute to the capture base reduce reusability of that object
  • flagged attributes are contextual - it depends on the use case and jurisdiction something which can be treated as identifiable information in one country e.g. postal code in UK could be completely not sensitive in another e.g. Hong Kong
  • building complex object could lead to "converting" one attribute from not sensitive to very sensitive if it would be used together with another (e.g. birth day - on it's own does not mean much and does not reveal identifiable information, only compare it with another like address or name it can lead to identify specific person)

Proposition:
Move all sensitivity topic towards Sensitive Overlay which already exist and serve that specific purpose, we could develop it further towards sensitivity matrix and more. This would release load from capture based and increase reusability of the object among the use cases. It would as well decouple responsibility for creating the object and flagging object. We could have multiple entities taking care of it which in practice often is the case (object is created while developing then goes into legal who review it and depending on the condition could apply restriction how to treat specific attributes.
Further on Sensitive Overlay could be extended with general Flagging Overlay which could flag other types of characteristic (e.g. confidentiality)

@mitfik mitfik added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 10, 2024
@pknowl
Copy link
Collaborator

pknowl commented Jan 24, 2024

Separating attribute flagging from the capture base is not a good idea, particularly in ensuring consistency and integrity in large-scale data aggregation, like when aggregating clinical trial data from multiple sister studies for accurate statistical analysis.

  1. Consistency: Maintaining the flagging within the capture base ensures consistency across different uses of the same base. This is crucial in scenarios where the same data set is used in multiple settings.

  2. Data Integrity: With flagging as an integral part of the capture base, there's less risk of misalignment or oversight, as the sensitivity analysis is directly tied to the data.

  3. Simplification in Aggregation: When aggregating data from various sources, having a consistent flagging mechanism reduces the complexity and potential errors in harmonizing data.

  4. Risk of Misalignment: If not carefully managed, different applications of the same capture base could end up with inconsistent flagging, leading to errors in data aggregation.

@mitfik
Copy link
Contributor Author

mitfik commented Jan 24, 2024

Consistency, integrity and simplification in aggregation are assured by the bundle anyway so nothing changes here.

Risk of misalignment - is on governance layer anyway if you would like to pull data from different jurisdiction flagging won't be aligned anyway due to different regulation, not having same capture base would cause this misalignment even more. Extracting it to seperate layer and push responsibility to the governance layer helps better facilitate data exchange across governances without loosing any of existing properties and use cases.

@pknowl
Copy link
Collaborator

pknowl commented Jan 24, 2024

Some additional considerations:

  • Regulatory Compliance: In many industries, especially in healthcare and clinical research, regulatory compliance is non-negotiable. Keeping flagging within the capture base can ensure that all data handling meets legal standards consistently.

  • Audit Trails and Accountability: By maintaining flagging as part of the capture base, it's easier to create transparent audit trails and ensure accountability, as every instance of data use will inherently include the necessary sensitivity flags.

  • Training and Standardization: Centralizing flagging processes can simplify training requirements and standardization efforts across different teams and departments, as the principles and practices remain consistent across all applications.

Clinical data is heavily regulated, and sensitive attributes require global standardization. This is especially true in clinical trials where data must adhere to stringent international regulations such as HIPAA in the United States, GDPR in Europe, and other national and international frameworks. These standards ensure uniform protection of patient information across borders, making consistency in data handling not just a best practice, but a regulatory requirement.

@blelump
Copy link
Member

blelump commented Jan 25, 2024

The proposed change strengthens rather than weakens the overall design. It extracts specific metadata that describes particular behavior that is optional. Some use cases benefit from it, and some do not; therefore, we propose making it coherent by moving it into a separate object.

Technically, PIIs are no different metadata than any other existing overlay. It is up to governance to ensure PIIs are correctly handled and compliant with regulations. With this change technically, we introduce a minor tweak for task-oriented granularity.

  1. Regarding consistency, if there are many OCA Bundles with the same CB and distinct sensitive overlays within the same governance, there is something wrong on the governance level. Recall that distributed governance brings reputation, which we rely on, especially in highly regulated environments.
  2. Regarding integrity, same as above.
  3. Data harmonization is the process of an inner or cross-governance level. In highly regulated environments, we must adhere to these regulations, and performing data harmonization is also a part of governance. Specifically how to harmonize and what are the process constraints.
  4. Same as p 1.
  5. Regarding regulatory compliance, nothing has changed. Technology still supports the requirements, and if they must be adhered to in a given use case, it is part of the governance to specify it.
  6. Governance decides whether a given use case must be audited and, if so, how. Recall audit trials' role in the ecosystem is to provide unambiguous information about changes. How a change adheres to governance and higher-level regulations is the role of the auditor to verify. We can provide what has changed and what bundle was used at that time. It's all the auditor has to know.
  7. Training and Standardization: If this point is about what flows through the ecosystem, who consumes it, or what the level of quality is, it is, again, not the role of the underlying technology to make it clean. This is the role of the reputation that distributed governance brings into it.

@pknowl
Copy link
Collaborator

pknowl commented Jan 27, 2024

This points to Data Governance versus Data Quality [link]. While the above argument focuses on separating the flagging into a different layer for flexibility and governance reasons, my concerns about maintaining consistency and integrity in a highly regulated environment like clinical trials are still valid. In such environments, a combination of standardized documentation per domain (e.g., Vital Signs, Adverse Events, Demography, etc.) and more tailored, study-specific documentation can balance consistency and flexibility. A hybrid approach would ensure adherence to core CDISC standards while allowing for necessary adjustments at the study level. This hybrid approach has always been in my awareness since Day 1 of the OCA invention. It still holds firm today.

@carlyh-micb
Copy link
Collaborator

Going through the OCA spec, there is already an additional "sensitive" overlay.

Sensitive Overlay

A Sensitive Overlay defines attributes not necessarily flagged in the Capture Base that need protecting against unwarranted disclosure. For example, data that requires protection for legal or ethical reasons, personal privacy, or proprietary considerations.

In addition to the capture_base and type attributes (see Common attributes), the Sensitive Overlay MUST include the following attribute:

attributes

The attributes attribute is an array of attributes considered sensitive.

{
"capture_base": "EVyoqPYxoPiZOneM84MN-7D0oOR03vCr5gg1hf3pxnis",
"type": "spec/overlays/sensitive/1.0",
"attributes": [
"sex"
]
}

@blelump
Copy link
Member

blelump commented Feb 1, 2024

This points to Data Governance versus Data Quality [link]. While the above argument focuses on separating the flagging into a different layer for flexibility and governance reasons, my concerns about maintaining consistency and integrity in a highly regulated environment like clinical trials are still valid. In such environments, a combination of standardized documentation per domain (e.g., Vital Signs, Adverse Events, Demography, etc.) and more tailored, study-specific documentation can balance consistency and flexibility. A hybrid approach would ensure adherence to core CDISC standards while allowing for necessary adjustments at the study level. This hybrid approach has always been in my awareness since Day 1 of the OCA invention. It still holds firm today.

Bringing Data Governance versus Data Quality to this discussion, specifically that there is a trade-off, it is already at least a misunderstanding. While (distributed) governance imposes regulations within an ecosystem and cross-ecosystem level, data quality is about the quality of granularity of captured metadata that describes (enriches) the data. Therefore, these two characteristics are incomparable as they are entirely different topics.

Our reasoning in the proposal is about extracting a piece, PII's, so to speak, where they naturally fit. Technically, they are just another metadata that serves a specific purpose. Therefore, nothing changes in the underlying technology's perception or narrative.

Since this is a technical change, please elaborate on what concerns regarding consistency and integrity, their maintenance, etc, you have in mind. This will help to keep this discussion clear.

@pknowl
Copy link
Collaborator

pknowl commented Feb 1, 2024

While I understand and appreciate the distinction you've drawn between Data Governance versus Data Quality, I believe our discussion intersects both in the context of ensuring the integrity and utility of our data systems, particularly under the umbrella of clinical data standards.

The crux of my concern revolves around the proposal to use Sensitive Overlays exclusively for flagging PII, removing this function from the Capture Base. While technically feasible and offering a degree of flexibility, this approach introduces potential risks to the consistency and integrity of our data handling practices, especially in environments governed by stringent regulations like clinical trials.

### Scenario Illustration

Consider two scenarios:

Scenario 1 (Without Flagging Block in Capture Base):
We have a Capture Base without a flagging block for PII.
Separate entities apply two different Sensitive Overlays, each identifying different attributes as sensitive.
This results in inconsistent protection measures for the same dataset when used across different contexts or studies, potentially leading to compliance issues and data mismanagement.

Scenario 2 (With Flagging Block in Capture Base):
The Capture Base includes a flagging block explicitly identifying core PII attributes.
Sensitive Overlays can augment this with additional protections but cannot alter the baseline flagging established in the Capture Base.
This ensures consistent handling of identified PII across all uses, providing a stable foundation for compliance and data integrity.

Why Maintain a Flagging Block in the Capture Base?

  • Consistency Across Use Cases: A unified approach to identifying PII ensures that, irrespective of the application or jurisdiction, there's a baseline protection that aligns with global data protection standards and clinical trial regulations.

  • Foundation for Regulatory Compliance: Given the variability in regulations across jurisdictions, a flagging block in the Capture Base acts as a common denominator, ensuring that essential PII is always treated with the highest privacy and security standards.

  • Data Quality and Governance Synergy: While governance sets the framework for data handling and privacy standards, the quality of data management practices is inherently tied to how consistently these standards are applied. Integrating PII flagging within the Capture Base enhances data governance and quality by ensuring that sensitive attributes are uniformly recognized and protected.

  • Technical and Operational Efficiency: A stable, consistent method for flagging PII reduces the complexity and potential for error in managing sensitivity across datasets, particularly in large-scale, multi-study environments like ours.

The proposal to maintain a flagging block within the Capture Base is not merely a technical preference but a strategic decision to uphold the highest data integrity, privacy, and compliance standards. It provides a necessary foundation upon which flexible, context-specific handling of additional sensitivities can be built through Sensitive Overlays.

This approach does not diminish the importance or utility of distributed governance or the granularity of data quality. Instead, it ensures that these two critical aspects of our data management strategy work harmoniously to support the safe, effective, and compliant use of clinical data, for example.

If you need some code and dataset examples to delve deeper, drop a comment, and I'll put together a dummy Capture Base, a couple of Sensitive Overlays, some data packets, and some corresponding datasets. This will illustrate how flagging misalignment, when aggregating datasets from various sources, can rapidly lead to significant data protection challenges.

@blelump
Copy link
Member

blelump commented Feb 1, 2024

In Scenario 1, you're contradicting yourself... If the environment is highly regulated, there are no accidents. The process described in scenario 1 is possible because we consider entropy as allowed while regulations aim to diminish entropy. In other words, there's no free form. If such an incident happens, it is not because the technology failed. The governance failed, and the technology has nothing to do with it.

To sum up, it all starts with governance. If governance fails, we enter chaos.

@carlyh-micb
Copy link
Collaborator

There are always accidents. Even in highly regulated environments. Like Boeing and the recent door issue. That's why we apply a swiss cheese model of risk and don't put all our eggs in one basket as it were.

@pknowl
Copy link
Collaborator

pknowl commented Feb 2, 2024

My scenario doesn't contradict itself; rather, it acknowledges the limitations of governance and proposes a more resilient, layered approach to risk management that combines strong governance with robust technological solutions. This hybrid approach recognizes that accidents or errors can occur even in highly regulated environments and seeks to build in safeguards at every level of the data management process. Carly's reference to the Swiss Cheese Model of risk management supports my argument by illustrating that multiple layers of defense are necessary to prevent failures. No single layer (governance, technology, etc.) is foolproof, and relying solely on governance without leveraging technological safeguards increases the risk of failure.

A hybrid approach, combining a flagging block in the capture base with the additional flagging capacity of a sensitive overlay, is preferred from a data management perspective because it ensures foundational consistency and integrity of sensitive data across all uses while also providing the flexibility to address specific or additional sensitivities as needed. This strategy leverages the strengths of both mechanisms: the capture base flagging guarantees a baseline level of data protection and regulatory compliance, crucial in highly regulated environments like clinical trials, and the sensitive overlay allows for adaptable, context-specific handling of data sensitivities. This dual-layered approach optimizes data governance by embedding essential safeguards directly into the technology, thereby reducing reliance on external governance mechanisms alone and enhancing the overall resilience and reliability of the data management system.

@pknowl
Copy link
Collaborator

pknowl commented Feb 3, 2024

I spoke with Neil T. and Steven M. yesterday about an attribute-level security model. This method would focus on marking individual attribute names, such as 'birth_date,' as PII from the outset, with a clear 'DateTime' data type. We would need to deploy a library of predefined attributes for easy Capture Base ingestion. This approach would boost attribute reusability and adaptability while integrating data protection at the start of the attribute generation process, simplifying governance and PII data security. Building Capture Bases from these predefined attributes would lock in data security and simplify adherence to privacy laws while providing a robust method for protecting our information at the genesis point of attribute creation.

I think it is worth discussing this approach in a new thread for further consideration.

@blelump
Copy link
Member

blelump commented Feb 4, 2024

Our proposal strengthens the big picture and does not focus on particular use cases. While flexibility is a significant argument for doing so, the more foundational one is keeping the technology unbiased.

All the arguments brought into this discussion come from beliefs that are not accurate. Actually, following their narrative, where the "properly shaped" technology is the last bastion of defense works as long as people agree to operate upon agreements. In other words, it does not matter what the technology does, especially about the discussed problem of PII's. If the ambient layers, a.k.a. governance, are not respected or there is even a tiny consideration that it may fail, everything will fall apart. Furthermore, it does not matter whether the Swiss cheese model of risk or any other fancy model is used if the basic agreements are not respected. If foundations are weak, resiliency does not matter.

Accidents or incidents will always happen. What we can do is mitigate their consequences rather than eliminate them. Governance is the sole reason societies and ecosystems work and operate upon agreements or rules. Whether cultural or given by a law does not matter.

What the tech team proposes is fully Swiss cheese model of risk compliant. The audience seems to forget the DDE triangle and the layers in it. To remind, one of them brings a reputation that adds trust basis to the whole ecosystem.

Suppose the issue with the proposal is about highly regulated ecosystems, where accuracy and correctness are first-class citizens. How could it ever work when we embrace to acknowledge the limitations of governance or do not even consider reputation?

The DDE triangle is not merely an instance of OCA repo where people can come anytime and from anywhere, query OCA bundles and/or their Capture Bases counterparts, and perform any arbitrary actions with them or upon them. It contradicts any highly regulated environment; it is the opposite.

The DDE triangle is the dance of multiple interacting layers that, as a whole, constitute lines of defense for protecting the core, or the base of the triangle, that is, the integrity layer, so to ensure the integrity is preserved. Whether we call this approach a Swiss cheese model of risk or anything else is a secondary consideration.

@blelump
Copy link
Member

blelump commented Feb 11, 2024

It's worth discussing once again the subject of governance limitations. Such a way of thinking, especially, may open unnecessary rabbit holes, that is, a false perception of the foundations of governance.

While governance is a formal expression of a set of rules that may constrain behavior, actions, and/or consequences, the existence of governance is a much more fundamental human-related topic. Although we, humans, are governance custodians, it's not a given we must obey what we set. This is a matter of our beliefs; this is our morality of what we perceive as right. We may perceive governance as weak, yet our self-consciousness may suggest self-improvement, as long as we see the root cause and we want to fix it. This is our morality suggesting we shall act. Proper problem layering / employing double-checking patterns and other mechanisms, all these are solely direct consequences of acting. Acting is a matter of ethics that is, on the other hand, a result of morality.

While there are many potential problem root causes, a significant one is the quality of the governance that heavily relies on the morality of the custodians, executors, and implementors. Although there are many possible reasons, it's easy to assume weak governance was to cut expenses, for example.

Acting morally has enormous implications in all the upper layers, that is, the governance and the governance execution. While the governance interpretation is up to us humans, we decide upon our moral values and how to act.

@pknowl
Copy link
Collaborator

pknowl commented Feb 12, 2024

Yes, you make a valid point. Everything starts with ethics.

Governance is the accountable framework describing "how" we implement systematic and epistemic rules in practice. It encompasses the structures, policies, and procedures that guide our actions and ideas, ensuring that we adhere to agreed-upon standards and regulations.

Ethics, conversely, represent the "why" behind our actions—the moral codes or values that guide our decision-making processes. These ethical principles motivate us to establish governance structures that comply with external regulations and align with our internal moral standards.

So, system design starts with ethics before governance, with our moral values ("why") dictating the rules of accountable method ("how"). "Ethics" sits within dynamical mechanics, emphasizing the practical application of ethical principles in shaping intelligent actions. "Governance" is a comprehensive framework of systematic and epistemic rules that integrate moral principles into its design.

Artificial Intelligence is a perfect example of converging the mechanics and governance sides of actions.

  • Ethical AI fits into "dynamical mechanics," as ethical AI focuses on "why" and "what for" — the active application of moral principles and values in developing and operating AI systems. It emphasizes the need for AI technologies to facilitate decisions and trigger morally sound actions that reflect ethical considerations, such as fairness, privacy, and non-discrimination. This dynamic aspect of constantly applying ethical principles in real-time scenarios and decision-making processes places "ethical AI" within the realm of dynamical mechanics, where the practical application of ethics shapes the behavior of AI systems.

  • Accountable AI fits into "systematic governance," as accountability in AI systems refers to the structures, policies, and procedures that ensure AI technologies are transparent and explainable, including methods to frame responsibility. Accountable AI involves creating and adhering to a framework that allows oversight, auditability, and compliance with established standards and regulations. This framework ensures that AI systems operate within agreed-upon ethical and legal boundaries, making "accountable AI" a part of systematic governance and integrating ethical principles into its design, primarily focusing on the "how" — the implementation of rules and processes to ensure AI systems are answerable and their operations are justifiable and transparent.

In essence, ethics provide the "why" and "what for" — the reasons behind actions and the outcomes we aim for, based on our values. Governance provides the "how" — the structured approach to achieving those outcomes within a set of defined rules and guidelines.

@blelump
Copy link
Member

blelump commented Feb 12, 2024

While ethics significantly impacts people's morality, when we discuss the foundations, we start with ourselves, that is, individuals. Morality is much more personal; therefore, ethics is an implication/consequence of morality's existence. Most of the time, they both must follow in the same direction. Otherwise, there's a contradiction.

The distinction between morality and ethics is about consequences. The former might be influenced by ethics and other factors, but it is personal. Acting morally, therefore, is the most powerful tool in a person's hands. It's the individual choice of how to act, but all the examples mentioned above are precisely, fundamentally, about morality—being moral means to act when seeing an issue. Ethics, on the other hand, is the consequence. We can say the use case is ethical, effectively labeling it as an ethical use case or AI is ethical, labeling it ethical AI, but both are not personal. They are just tools at a person's disposal.

@mitfik mitfik added the v2 label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request v2
Projects
None yet
Development

No branches or pull requests

4 participants