| notebook.community

.. highlight:: none .. _design-data-model: Design Concepts - Data Model ============================ The ability to conduct effective analysis across a Synapse hypergraph is highly dependent on the design and implementation of an appropriate **data model** for a given knowledge domain. The specifics of any data model (types, forms, properties) will vary based on both the knowledge domain and specific analytical needs. A full discussion of the considerations (and potential complexities) of creating a well-designed data model are beyond the scope of this documentation. However, there are a few general principles and recommendations that should be kept in mind when developing a data model: - **The model is an abstraction.** Analysis often involves subtle distinctions and qualifications; this is why analysis is often provided in long-form reporting, where natural language can be used to convey variations in confidence or provide caveats surrounding conclusions. Capturing data and corresponding analysis in a formalized data model trades some of these subtleties for consistent representation and programmatic accessibility. A data model can never fully capture the richness and detail of a long-form report. But a well-designed data model can capture critical components in a way that conveys sufficient depth of analytical findings so that an analyst only rarely needs to refer to additional long-form reporting or original sourcing for clarification. - **The model should be self-evident.** While the model is necessarily an abstraction, it should not be abstracted to the point where the data and analysis in the model cannot stand on their own. That is, while there are times when supplemental external reports or explanatory notes may be helpful, they should not be **required** in order to understand the information in a Cortex. The model should be designed to convey the maximum amount of information possible: entities, relationships, and analytical annotations should be unambiguous, well-defined, and clearly understood. An analyst with domain knowledge but no prior exposure to analytical findings should be able to look at the information represented in a Cortex and understand the analytical line of thought. - **The model should be driven by real-world analytical need and analytical relevance.** Any model, regardless of knowledge domain, should be designed around the analytical questions that need to be answered by that model. Many models are designed as academic abstractions ("how would we classify all possible exploitable vulnerabilities in software?") without consideration for the practical questions that the data is intended to answer. Are some exploits theoretically possible, but never yet observed in the real world? Are some distinctions too fine-grained (or not fine-grained enough) for your analytical needs? Domain knowledge experts should have significant input into the type of data modeled, what analysis needs to be performed on the data, and how it should be represented. The best models evolve in a cycle of forethought combined with real-world stress-testing. Creating a model "on the fly" with no prior consideration often leads to a narrowly-focused, fragmented data model – in the immediacy of detailed analysis, analysts (or developers) may focus on the trees while missing the big picture of the forest. However, even the most well-thought-out model planned in the abstract will still fall short when faced with the vagaries and inconsistencies real-world data. Experience has shown that there are always edge cases, circumstances, or anomalies that cannot be anticipated. The most effective models are typically those that are planned carefully up front, then tested against real-world data and refined before being placed fully into production. - **Err on the side of detail.** No data model is set in stone – in fact, a good model will both expand and evolve with analytical need. That said, changes to the model may require revising or updating existing model elements and associated analysis, and some changes are easier to effect than others. Generally speaking, it is easier to combine two things (tags, nodes, properties) that you later decide are "the same" than it is to split up one large thing that you later decide should be two different things – if only because splitting things may require manual review of each item to determine which category it should belong to. If there is a question regarding a design decision, more detail is often preferable until the design can be fully stress-tested.