Historization with CleanerVersion¶
Disclaimer: This documentation as well as the CleanerVersion application code have been written to work against Django 1.9.x through 1.11.x. The documentation may not be accurate anymore when using more recent versions of Django.
Quick Start¶
Installation¶
If you don’t like to work with the sources directly, you can also install the CleanerVersion package from PyPI by doing so (you may need superuser privileges, as for every other pip-installation):
pip install cleanerversion
Afterwards, add versions to your INSTALLED_APPS
- INSTALLED_APPS = (
- … ‘versions’, …
)
If you want to check whether your components are compatible with CleanerVersion, you can run the unit tests coming
with CleanerVersion. To do so, register the CleanerVersion test app to the INSTALLED_APPS
variable
of your Django project by adding the versions_tests
keyword as follows:
INSTALLED_APPS = (
...
'versions_tests',
...
)
Now, whether things work out correctly, run CleanerVersion’s unit tests from within your Django project root:
python manage.py test versions_tests
If this terminates with a OK
, you’re all set. Go on and create your models as follows.
Keep in mind that you are not required to keep versions_tests
in your INSTALLED_APPS
settings, it will only
create unnecessary tables everytime you sync with your DB. So, you can safely remove it after having run the test suite.
A simple versionable model¶
First, import all the necessary modules. In this example, all the imports are done in the beginning, such that this would be a working example, if place in the same source file. Here’s how:
from datetime import datetime
from django.db.models.fields import CharField
from django.utils.timezone import utc
from versions.models import Versionable
class Person(Versionable):
name = CharField(max_length=200)
address = CharField(max_length=200)
phone = CharField(max_length=200)
Assuming you know how to deal with Django Models (you will need to migrate your DB before your code gets usable; Or you’re only testing, then that step is done by Django), the next step is using your model to create some entries:
p = Person.objects.create(name='Donald Fauntleroy Duck', address='Duckburg', phone='123456')
t1 = datetime.utcnow().replace(tzinfo=utc)
p = p.clone() # Important! Fetch the returned object, it's the current one! Continue work with this one.
p.address = 'Entenhausen'
p.save()
t2 = datetime.utcnow().replace(tzinfo=utc)
p = p.clone()
p.phone = '987654'
p.save()
t3 = datetime.utcnow().replace(tzinfo=utc)
Now, let’s query the entries:
donald_current = Person.objects.as_of().get(name__startswith='Donald') # Get the current entry
print str(donald_current.address) # Prints 'Entenhausen'
print str(donald_current.phone) # Prints '987654'
donald_t1 = Person.objects.as_of(t1).get(name__startswith='Donald') # Get a historic entry
print str(donald_t1.address) # Prints 'Duckburg'
print str(donald_t1.phone) # Prints '123456'
Slowly Changing Dimensions - Type 2¶
Find the basics of slowly changing dimensions - type 2 and other types at Wikipedia. These concepts were taken over and extended to cover different types of relationships.
The technical details and assumptions are documented in the following sections.
Historization of a single entity¶
The definition of Versionable
fields is as follows:
- id
- The virtual ID of an entry. This field figures also as the primary key (pk) and is randomly created
- identity
- Identifies an object over all its versions, i.e. identity does not change from one version to another
- version_birth_date
- The timestamp at which an object was created. All versions of an object will have the same creation date.
- version_start_date
- The timestamp at which a version was created.
- version_end_date
- The timestamp at which a version was cloned. If a version has not been cloned yet,
version_end_date
will be set toNone
(or NULL) and the entry is considered the most recent entry of an object (i.e. it is the object’s current version)
Let’s assume the following class definition for this hands-on:
class Item(Versionable):
name = CharField(max_length="200") # referred to as the payload data
version = CharField(max_length="200") # part of the payload data as well; added for more transparency
Having the class, let’s create an instance of it:
item = Item.objects.create(name="Peter Muster", version="1")
This sequence of commands generated the following DB entry in the table associated to Item
(inheriting from
Versionable
):
id (pk) | identity | version_birth_date | version_start_date | version_end_date | name | version |
---|---|---|---|---|---|---|
123 | 123 | 2014-08-14 14:43:00 | 2014-08-14 14:43:00 | None | Peter Muster | 1 |
Once you wish to change some value on your object, do it as follows:
item = item.clone()
item.name = "Peter Mauser"
item.version = "2"
item.save()
In the first line, we create the new version of the item entry and assign it immediately to the same variable we used to work with.
On the new version, we can now change the payload data at will and save()
the object, once we’re done.
On a DB level, things will look as follows:
id (pk) | identity | version_birth_date | version_start_date | version_end_date | name | version |
---|---|---|---|---|---|---|
123 | 123 | 2014-08-14 14:43:00 | 2014-08-14 15:09:00 | None | Peter Mauser | 2 |
124 | 123 | 2014-08-14 14:43:00 | 2014-08-14 14:43:00 | 2014-08-14 15:09:00 | Peter Muster | 1 |
Notice the primary key of the current entry did not change. The original id
will always point the current version of
an object.
Revisions of an object (i.e. historic versions) are copies of the current entry at the time pointed by the version’s
version_end_date
.
For making things clearer, we create another version:
item = item.clone()
item.name = "Petra Mauser"
item.version = "3"
item.save()
Once again, the situation on DB level will present itself as follows:
id (pk) | identity | version_birth_date | version_start_date | version_end_date | name | version |
---|---|---|---|---|---|---|
123 | 123 | 2014-08-14 14:43:00 | 2014-08-14 15:21:00 | None | Petra Mauser | 3 |
124 | 123 | 2014-08-14 14:43:00 | 2014-08-14 14:43:00 | 2014-08-14 15:09:00 | Peter Muster | 1 |
125 | 123 | 2014-08-14 14:43:00 | 2014-08-14 15:09:00 | 2014-08-14 15:21:00 | Peter Mauser | 2 |
On a timeline, the state can be represented as follows:
Many-to-One relationships¶
Declaring versioned M2O relationship¶
Here’s an example with a sportsclub that can practice at most one sporty discipline:
class SportsClub(Versionable):
"""Sort of an association for practicing sports"""
name = CharField(max_length=200)
practice_periodicity = CharField(max_length=200)
discipline = VersionedForeignKey('Discipline')
class Discipline(Versionable):
"""A sports discipline"""
name = CharField(max_length=200)
rules = CharField(max_length=200)
If a many-to-one (M2O) relationship can also be unset, don’t forget to set the nullable flag (null=true) as an argument
of the VersionedForeignKey
field.
Adding objects to a versioned M2O relationship¶
Let’s create two disciplines and some sportsclubs practicing these disciplines:
running = Discipline.objects.create(name='Running', rules='There are none (almost)')
icehockey = Discipline.objects.create(name='Ice Hockey', rules='There\'s a ton of them')
stb = SportsClub.objects.create(name='STB', practice_periodicity='tuesday and thursday night',
discipline=running)
hcfg = SportsClub.objects.create(name='HCFG',
practice_periodicity='monday, wednesday and friday night',
discipline=icehockey)
lca = SportsClub.objects.create(name='LCA', practice_periodicity='individual',
discipline=running)
t1 = datetime.utcnow().replace(tzinfo=utc)
Reading objects from a M2O relationship¶
Now, let’s read some stuff previously loaded:
sportsclubs = SportsClub.objects.as_of(t1) # This returns all SportsClubs existing at time t1 [returned within a QuerySet]
You can also use select_related() to reduce the number of database queries made, if you know that you’ll need the ForeignKey-related objects:
# Only one database query is made for this set of statements:
hcfg = SportsClub.objects.current.select_related('discipline').get(name='HCFG')
print hcfg.discipline.name
Note that select_related only works for models containing foreign keys. It does not work for reverse relationships:
# This does not save any database queries! select_related() has no effect here:
icehockey = Discipline.objects.current.select_related('sportsclub_set').get(name='Ice Hockey')
print icehockey.sportsclub_set.first().name
This is not a CleanerVersion limitation; it’s just the way that Django’s select_related() works. Use prefetch_related() instead if you want to prefetch reverse or many-to-many relationships. Note that prefetch_related() will use at least two queries to prefetch the related objects. See also the Notes about using prefetch_related.
Filtering using objects¶
Following on the above example, let’s create a new version of the running Discipline. First, though, let’s take a look at the id, identities and foreign keys as they are now:
>> (running.id, running.identity)
(1, 1)
>> (stb.discipline_id, stb.id, stb.identity)
(1, 10, 10)
>> (lca.discipline_id, lca.id, lca.identity)
(1, 20, 20)
OK, so now we create a new version:
running = running.clone()
running.rules = "Don't run on other's feet"
running.save()
# Fetch the old version from the database:
running_at_t1 = Discipline.objects.as_of(t1).get(name='Running')
How do the id, identities, and foreign keys look at this point?
>> (running.id, running.identity)
(1, 1)
>> (running_at_t1.id, running_at_t1.identity)
(2, 1)
>> (stb.discipline_id, stb.id, stb.identity)
(1, 10, 10)
>> (lca.discipline_id, lca.id, lca.identity)
(1, 20, 20)
The objects running
and running_at_t1
have different ids, but the same identity; they are different
versions of the same object. The id of the old version has changed; the new version has the original id value.
Notice that stb
and lca
still refer to Discipline with id 1
. When they were created, at t1, they were
actually pointing to a different version than the current version. Their discipline_id column was not updated to
point to the old version when running
was cloned. This is an important implementation detail - foreign
keys point to the latest version of the foreign object, which always has it’s id equal to it’s identity. If this
was not the case, it would be necessary to clone all of the objects that have a foreign key pointing to object X when
object X is cloned; this would result in a very quickly growing database.
When searching for an object at a given time t1, foreign key values are matched against the related records identity column, and the related record are further restricted to those records that are valid at t1.
All of this should help you understand that when you filter a query for a certain point in time using an object, it’s actually the identity of the object that will be used for the filtering, and not the id. You are effectively saying, “I want to limit to records that were associated with some version of this object”.
>> stb1 = SportsClub.objects.as_of(t1).filter(discipline=running, name='STB').first()
>> stb2 = SportsClub.objects.as_of(t1).filter(discipline=running_at_t1, name='STB').first()
>> (stb1.discipline.id, stb2.discipline.id)
(2, 2)
>> stb3 = SportsClub.objects.current.filter(discipline=running, name='STB').first()
>> stb4 = SportsClub.objects.current.filter(discipline=running_at_t1, name='STB').first()
>> (stb3.discipline.id, stb4.discipline.id)
(1, 1)
If you really want to filter using the id of the object, you need to explicitly use the id instead of passing the object itself:
>> stb5 = SportsClub.objects.as_of(t1).filter(discipline_id=running.id, name='STB').first()
>> stb6 = SportsClub.objects.as_of(t1).filter(discipline_id=running_at_t1.id, name='STB').first()
>> (stb5.discipline.id, stb6 is None)
(2, True)
>> stb7 = SportsClub.objects.current.filter(discipline_id=running.id, name='STB').first()
>> stb8 = SportsClub.objects.current.filter(discipline_id=running_at_t1.id, name='STB').first()
>> (stb7.discipline.id, stb8 is None)
(1, True)
Many-to-Many relationships¶
Declaring versioned M2M relationships¶
Assume a Person can be part of multiple SportsClubs:
class Person(Versionable):
name = CharField(max_length=200)
phone = CharField(max_length=200)
sportsclubs = VersionedManyToManyField('SportsClub', related_name='members')
class SportsClub(Versionable):
"""Sort of an association for practicing sports"""
name = CharField(max_length=200)
practice_periodicity = CharField(max_length=200)
Adding objects to a versioned M2M relationship¶
Adding objects to a many-to-many relationship works just like in standard Django:
person1 = Person.objects.create(name="Hanover Fiste", phone="555-1234")
person2 = Person.objects.create(name="Gloria", phone="555-6777")
club = SportsClub.objects.create(name="Sweatshop", practice_periodicity="daily")
# This is one way to do it:
club.members.add(person1, person2)
# Another way to do it to assign a list. This will remove any existing
# members that are not in the list, and add any members that are in the
# list, but not yet associated in the database.
club.members = [person1, person2]
Changing many-to-many relationships is only allowed when using the current version of the object:
# This would raise an Exception:
old_club = SportsClub.objects.previous_version(club)
old_club.members.add(person3)
Reading objects from a versioned M2M relationship¶
This works just like in standard Django, with the exception that you specify either that you are using the current state, or the state at a specific point in time:
# Working with the current state:
club = Club.objects.current.get(name='Sweatshop')
local_members = club.members.filter(phone__startswith='555').all()
# Working with a specific point in time:
november1 = datetime(2014, 11, 1).replace(tzinfo=utc)
club = Club.objects.as_of(november1).get(name='Sweatshop')
# The related objects that are retrieved were existing and related as of november1, too.
local_members = club.members.filter(phone__startswith='555').all()
# Queries can of course traverse relationships, too:
clubs_with_g_members = Club.objects.current.filter(members__name__startswith='G').all()
Versioning objects being part of a versioned M2M relationship¶
Versioning an object in a ManyToMany relationship requires 3 steps to be done, including the initial setup:
Setting up the situation requires to add at least two objects to a M2M relationship:
blog1.items.add(item1)
Further on, let’s clone the Item-instance:
new_item1 = item1.clone()
CleanerVersion takes care of cloning and re-linking also the relationships:
# done automatically by cleanerversion when item1.clone() was called
The records in ManyToMany intermediary tables are versioned: they have version_birth_date
,
version_start_date
and version_end_date
columns. The ForeignKey columns in ManyToMany
Intermediary tables store the id
of the referenced records. Note that this is different than
the VersionedForeignKeys in Versionable model instances, which store the identity
of the
referenced objects.
This is transparent in normal usage, but can be important to keep in mind when you need to write a query
that directly references the ForeignKey columns.
Removing objects from a versioned M2M relationship¶
Changing many-to-many relationships is only allowed when using the current version of the object.
Deleting an object from a many-to-many relationship results in the record in the relationship table being
soft-deleted. In other words, a version_end_date
is set on the relationship record.
The syntax for soft-deleting is the same as the standard Django Model deletion syntax:
# Various ways to remove one or more associations:
club.members.remove(person1)
club.members.remove(person2, person3)
club.members.remove(person4.id)
club.members = []
Deleting objects¶
You can expect delete()
to behave like you are accustomed to in Django, with these differences:
Not actually deleted from the database¶
When you call delete()
on a versioned object, it is not actually removed from the database. Instead, it’s
version_end_date
is changed from None to a timestamp.
The same is true for the VersionedManyToManyField entries associated with the object you call delete()
on:
they are terminated by setting a version_end_date
.
on_delete handlers¶
on_delete handlers behave like this:
The deletion is cascaded. In the CleanerVersion context, this means that the cascaded-to versions are terminated.
The cascaded-to objects are cloned before SET, SET_NULL, or SET_DEFAULT are applied.
Does nothing, just like in standard Django. This has the effect of leaving a current object with a reference to a deleted object. However, if you ask the current object for it’s relations, it will not return the deleted object, because the deleted object does not match the current object’s query time restriction (e.g. only current objects).
Behaves just like in standard Django.
Restoring previous versions¶
Previous versions can be restored like this:
restored_version = old_version.restore()
restored_version
will now be the current version. This creates a new version, the old version is left untouched.
If any current version existed when this code ran, it was terminated before the restored version was created.
Be aware that relations (VersionedForeignKey, ManyToManyField, reverse foreign keys, etc.) are not restored. You will need to restore relations yourself if necessary.
If the object being restored has a non-nullable VersionedForeignKey, you will need to supply a value (object instance
or pk) for this field. If you do not supply a value, a versions.ForeignKeyRequiresValueError
will be raised.
Values can also be provided for other, non-ForeignKey fields at restore time.
Example:
Models:
class Team(Versionable):
name = models.CharField(max_length=50)
class Mascot(Versionable):
name = models.CharField(max_length=50)
age = models.IntegerField()
team = VersionedForeignKey(Team, null=False)
Code:
beaver = beaver_v1.restore(team=mascot_v1.team)
# You can also use an id instead of an object when providing ForeignKeys, just be
# sure to use the field.attname (usually: field name + '_id') as the parameter name:
new_team_pk = Team.objects.current.get(name='Black Stripes').pk
tiger = tiger_v4.restore(team_id=new_team_pk, age=33)
Deferred fields¶
It is not possible to clone or restore a version that has been fetched from the database without all of it’s fields, for example using one of these three equivalent statements:
club = Club.objects.current.defer(
'phone', 'identity, 'version_start_date', 'version_end_date', 'version_birth_date'
).first()
club = Club.objects.current.only('name').first()
club = Club.objects.raw("""
SELECT id, name FROM {} WHERE version_end_date IS NULL
""".format(Club._meta.db_table)[0]
Trying to do so will raise a ValueError. Any versioned object that needs to be cloned or restored
must be fetched from the database without using defer()
or only()
(or raw()
with only
some of the model’s fields).
Unique Indexes¶
To have unique indexes with versioned models takes a bit of care. The issue here is that multiple versions having the same
data can exist; potentially the only difference will be with the id
, version_start_date
, and version_end_date
columns.
For example, what if we want the name and phone_number to be unique together for current versions:
id (pk) | identity | version_birth_date | version_start_date | version_end_date | name | phone_number |
---|---|---|---|---|---|---|
123 | 123 | 2014-08-14 14:43:00 | 2014-08-14 15:21:00 | None | Petra Mauser | 555-1234 |
124 | 123 | 2014-08-14 14:43:00 | 2014-08-14 14:43:00 | 2014-08-14 15:09:00 | Peter Muster | 555-1234 |
In Postgresql, it’s possible to create a partially unique index which enforces that name and phone_number are unique together when the version_end_date is null. Other databases may have a similar capability. A helper method for creating these partially unique indexes is provided for Postgresql, see the Postgresql specific section for more detail.
Specifying the id of an object at creation time¶
It is possible to specify an id when creating a new object, instead of letting CleanerVersion do this for you. The id must be a unicode string representing a version 4 UUID.
Be careful if you do this!. The possibility of collisions can increase greatly if not all sources that specify a UUID use sufficient entropy. See this for more details.
The database-level unique constraint on the id will prohibit a duplicate uuid from being inserted, but your application will need to be ready to handle that.
Postgresql specific¶
Django creates extra indexes
for CharFields that are used for like queries (e.g. WHERE foo like ‘fish%’). Since Django 1.6 (the version
CleanerVersion originally targeted) did not have native database UUID fields, the UUID fields that are used for the id
and identity columns of Versionable models have these extra indexes created. In fact, these fields will never be
compared using the like operator. Leaving these indexes would create a performance penalty for inserts and updates,
especially for larger tables. versions.util.postgresql
has a function remove_uuid_id_like_indexes
that can be
used to remove these extra indexes.
For the issue of Unique Indexes, versions.util.postgresql
has a function
create_current_version_unique_indexes
that can be used to create unique indexes. For this to work, it’s necessary
to define a VERSION_UNIQUE attribute when defining the model:
class Person(Versionable):
name = models.CharField(max_length=40)
phone_number = models.CharField(max_length=20)
VERSION_UNIQUE = [['name', 'phone_number']]
If there are multiple sets of columns that should be unique, use something like this:
VERSION_UNIQUE = [['field1', 'field2'], ['field3', 'field4']]
As an extra method of protection against bad data appearing, it is good to ensure that only one version of an object
is current at the same time. This can be done by adding a partially unique index for the identity
column.
You can use versions.util.postgresql.create_current_version_unique_identity_indexes()
for this.
For an example of how to transparently create the database indexes for these VERSION_UNIQUE definitions in a Django app, removing the extra like indexes created on the CharField columns, and enforcing that only one version is current at the same time, see:
- https://github.com/swisscom/cleanerversion/blob/master/versions_tests/__init__.py
- https://github.com/swisscom/cleanerversion/blob/master/versions_tests/apps.py
Note that this example is for Django >= 1.7; it makes use of the application registry that was introduced in Django 1.7.
Integrating CleanerVersion versioned models with non-versioned models¶
It is possible to combine both, versioned models (as described up to this point) and non-versioned models.
In order to have your relationships work out correctly, make use of VersionedForeignKey
as described in the
following table.
For example, one has to read the table as follows: “If a model inheriting directly from Django’s Model
is pointing
a model inheriting from Versionable
, then a VersionedForeignKey
relation has to be used.”
Model def. FK \ Model pointed by FK | models.Model | Versionable |
---|---|---|
models.Model | ForeignKey() | VersionedForeignKey() |
Versionable | ForeignKey() | VersionedForeignKey() |
Note that M2M-relationships have not been extended yet to work in a heterogeneous use case as described here.
VersionedAdmin admin for Django Admin¶
VersionedAdmin has three boolean fields that allow subclasses to easily control if the shortened identity, the version
end date, and the version start date show in the change view. These fields are list_display_show_identity
,
list_display_show_end_date
, and list_display_show_start_date
and by default they are set to True
.
Out of the box, VersionedAdmin allows for filtering the change view by the as_of
queryset filter, and whether the
object is current.
Upgrade notes¶
CleanerVersion 2.x / Django 1.9/1.10/1.11¶
In Django 1.9 major changes to the ORM layer have been introduced, which made existing versions of CleanerVersion for incompatible with Django 1.9 onwards. We decided to release a separate major version to support the Django 1.9 to 1.11.
CleanerVersion 1.6.0 / Django 1.8.3¶
Starting with CleanerVersion 1.6.0, Django’s UUIDField
will be used for the id
, identity
,
and VersionedForeignKey
columns if the Django version is 1.8.3 or greater.
If you are upgrading from lower versions of CleanerVersion or Django, you have two choices:
Add a setting to your project so that CleanerVersion will continue to use
CharField
forVersionable
’s UUID fields. Add this to your project’s settings:VERSIONS_USE_UUIDFIELD = False
This value defaults to True
if not explicitly set when using Django >= 1.8.3.
Convert all of the relevant database fields to the type and size that Django uses for UUID fields for the database that you are using. This may be possible using Django’s migrations, or could be done manually by altering the column type as necessary for your database type for all the
id
,identity
, and foreign key columns of yourVersionable
models (don’t forget the auto-generated many-to-many tables). This is not a trivial undertaking; it will involve for example dropping and recreating constraints. An example of column altering syntax for PostgreSQL:ALTER TABLE blog_author ALTER COLUMN id type uuid USING id:uuid; ALTER TABLE blog_author ALTER COLUMN identity type uuid USING identity:uuid;
You must choose one or the other solution; not doing so will result in your application no longer working.
Known Issues¶
- No multi-table inheritance support. Multi-table inheritance currently does not work if the parent model has a Versionable base class. See this issue for more details.
- Creating Unique Indexes is a bit tricky for versioned database tables. A solution is provided for Postgresql (see the Postgresql specific section). Pull requests are welcome if you solve this problem for another database system.
For a more up-to-date state please check our project page.