Primary keys¶
What is a primary key?¶
The primary key, also known as a PK, is a designated attribue that can reliably identify and distinguish each record in a table. With bframe, core data models have PKs that are a combination of the id
, version
, org_id
and env_id
fields. A single ID can appear multiple times as long as it has a different version, if the version is the same it will be rejected. The id
is a surrogate key and the version
is an incrementing count that specifies the iteration of the row. The org_id
references the organization and the env_id
references the environment.
Why is a primary key important?¶
In general, a primary keys importance is the ability for it to be a unique identifier for a specific row of data. Having a unique identifier allows for an application to refer to it and ensures that other rows aren’t confused with it. Within bframe, the PK is used to maintain the integrity of the data. For example, a row with a new version was submitted twice by accident. If the database allowed for both rows to be inserted it could cause unintended consequences. Subsequent reads of the data could pull two identical objects instead of a single unique one. Depending on the contents of the table, this could lead to inaccurate rating, metering or nonsensical provisioning. Data quality is a never ending challenge and PKs are a great tool for keeping data hygiene intact.
How are primary keys modeled in bframe?¶
Based on the underlying data store the implementation strategy will differ. For transactional RDBMSs, like Postgres or MySQL, it is straightforward to add a PK constraint to the specified table.
CREATE TABLE IF NOT EXISTS customers (
id BIGINT,
org_id INTEGER NOT NULL,
env_id INTEGER NOT NULL,
...
version INTEGER DEFAULT 0,
PRIMARY KEY (id, version, org_id, env_id)
);
If this option is available it should be taken. Otherwise the primary key will be diffcult to fully enforce. In this scenario the best we can do is use data validation pre and post insertion. Additionally, regular data quality checks are essential if the data store in use does not support uniqueness constraints. A simple check could look something like this:
SELECT count(*)
FROM (
SELECT raw.*, ROW_NUMBER() OVER(
PARTITION BY raw.id, raw.version
ORDER BY raw.version DESC
) AS rank
FROM src.customers AS raw
)
WHERE rank > 1;
With this said, bframe does perform deduplication on the raw data to enable editing and branching. So even if the same id
and version
combo are submitted multiple times only one will be returned. This can allow for bframe to offer uniqueness for core models, albeit with some risk, without the guarantees of a transactional RDBMS.