I have a bit of a specialized request.
I'm reading a table of strings (specifically fixed length 36 char uuids generated via uuid.uuid4() in the standard library) from a file and creating a set out of it. Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the new items. Currently I am creating a copy of the set as soon as I load it and then when I go back to save it, i'm calculating the difference and saving just the difference.
There are many problems with this approach so far: 1) Calculating the difference via the standard set implementation is not very scalable -> O(n) I presume 2) Maintaining a copy wastes memory 3) I don't have a good solution if I delete items from the set (calculating the difference will return an empty set but I need to actually delete stuff).
I was thinking of writing a specialized set implementation (I have no idea how to start on this) which tracks new items (added after initialization) in a separate space and exposes a new API call which would enable me to directly get those values. This is kind of ugly and doesn't solve problem 3.
I also thought of using a hastable but I'm storing many ( > 1 million) of these sets in the same file (most of them will be empty or contain just a few values but a few of them will be very large - excess of 10,000 items). The file is already separated into tablespaces.
Does anyone have any ideas?
Thanks in advance. Prateek PS: Yes - I need blazing fast performance - simply pickling/unpickling won't do. Memory constraints are important but definitely secondary. Disk space constraints are not very important.
Prateek's gravatar image asked Apr 22 2007 at 03:13 in Python by Prateek

9 Answers

Yep, you do have to look at every item, so O(N) is an obvious lower bound.
(3) is easy -- the difference originalset-finalset is the set of things you have to delete.
Your problem is really a thinly disguised version of one of the most ancient and venerable ones in data processing -- the "master file / transaction file" problem. With sets (which are built as hash tables) you have very powerful weapons for this fray (it used to be, many years ago, that sorting and "in-step processing of sorted files" was the only feasible approach to such problems), but you still have to wield them properly.
In your shoes, I would write a class whose instances hold three sets:
You can implement a set-like interface; for example, your .add method will need to remove the item from the deleted set (if it was there), then add it to the added set (if it wasn't already in the master set); and so on, for each and every method you actually desire (including no doubt __contains__ for membership testing). E.g., with suitably obvious names for the three sets, we could have...:
class cleverset(object): ... def __contains__(self, item): if item in self.deleted: return False elif item in self.added: return True else: return item in self.master
When the time comes to save back to disk, you'll perform deletions and additions based on self.deleted and self.added, of course.
I'm not addressing the issue of how to best represent such sets on disk because it's not obvious to me what operations you need to perform on the on-disk representation -- for example, deletions can be very costly unless you can afford to simply set a "logically deleted" flag on deleted items; but, depending on how often you delete stuff, that will eventually degrade performance due to fragmentation, and maintaining the indexing (if you need one) can be a bear.
Normally, I would use a good relational database (which has no doubt already solved on my behalf all of these issues) for purpose of persistent storage of such data -- usually sqlite for reasonably small amounts of data, PostgreSQL if I need enormous scalability and other "enterprise" features, but I hear other people are happy with many other engines for relational DBs, such as Oracle (used to be superb if you could afford full-time specialized db admins), IBM's DB2, SAP's database (now opensourced), Firebird, even MySQL (I've never, but NEVER, had any good experience with the latter, but I guess maybe it's only me, as otherwise it could hardly be so popular). The main point here is that a relational DB's tables _should_ be already very well optimized for storing "sets", since those table ARE exactly nothing but sets of tuples (and a 1-item tuple is a tuple for all that:-).
Alex
Alex Martelli's gravatar image answered Apr 22 2007 at 05:14 by Alex Martelli
This may be a silly question, but why? Why not just save the modified set, new items and old, and not mess about with complicated transactions?
After all, you say:
Since disk space is not important, I think that you shouldn't care that you're duplicating the original items. (Although maybe I'm missing something.)
Perhaps what you should be thinking about is writing a custom pickle-like module optimized for reading/writing sets quickly.
Steven.
Steven D'Aprano's gravatar image answered Apr 22 2007 at 06:09 by Steven D'Aprano
I tried just that. Basically ignored all the difficulties of difference calculation and just overwrote the entire tablespace with the new set. At about 3000 entries per file (and 3 files) along with all the indexing etc. etc. just the extra I/O cost me 28% performance. I got 3000 entries committed in 53s with difference calculation but in 68s with writing the whole thing.
I already did this. I'm not using the pickle module at all - Since I'm guaranteed that my sets contain a variable number of fixed length strings, I write a header at the start of each tablespace (using struct.pack) marking the number of rows and then simply save each string one after the other without delimiters. I can do this simply by issuing "".join(list(set_in_question)) and then saving the string after the header. There are a few more things that I handle (such as automatic tablespace overflow)
Prateek
Prateek's gravatar image answered Apr 22 2007 at 10:35 by Prateek
Now why didn't I think of that?
Ok. This class sounds like a good idea. I'll try it.
I actually don't have very many transactions I need to perform on disk. Once I've loaded these sets, I normally use the contains operation and union/intersection/difference operations. I can definitely use a tombstone on the record to mark it as logically deleted. The problem right now is that I don't need indexing within sets (I either load the whole set or don't bother at all) - I'm only indexing the entire file (one entry in the index per tablespace). So even if I use tombstones, I'll have to still go through all the tablespace to find the tombstoned files. Alternatively, I could write the deleted data at the end of the tablespace and mark it with the tombstone (since I don't care about space so much) and when I load the data, I can eliminate the tombstoned entries. I'm give that a try and report my performance results on this thread - I absolutely love that I can make this important a change in one or two days using Python!
Thanks Alex, but we're actually implementing a (non-relational) database engine. This particular file is one of many files (which are all kept in sync by the engine) I need to persist (some files use other methods - including pickling - for persistence). I've already solved most of our other problems and according to hotshot, this is
total CPU time @ 3000 items and 3 files). If you're interested, you can check it out at http://www.brainwavelive.com or email me.
Prateek
Prateek's gravatar image answered Apr 22 2007 at 10:50 by Prateek
In article ,
Why are you reinventing the wheel? Why not just implement your functionality on top of an existing database as your backing store?
Aahz (aahz at pythoncraft.com) http://www.pythoncraft.com/
Help a hearing-impaired person: http://rule6.info/hearing.html
Aahz's gravatar image answered Apr 22 2007 at 16:22 by Aahz
Oh dear god, I implemented this and it overall killed performance by about 50% - 100%. The same script (entering 3000 items) takes between 88 - 109s (it was running in 55s earlier).
Here is the new Set implementation: class SeaSet(set): __slots__ = ['master', 'added', 'deleted'] def __init__(self, s=None): if s is None: s = [] self.master = set(s) self.added = set() self.deleted = set()
def add(self, l): if l not in self.master: self.added.add(l) try: self.deleted.remove(l) except KeyError: pass
def remove(self, l): try: self.master.remove(l) self.deleted.add(l) except KeyError: try: self.added.remove(l) except KeyError: pass
def __contains__(self, l): if l in self.deleted: return False elif l in self.added: return True else: return l in self.master
def __len__(self): return len(self.master) + len(self.added)
def __iter__(self): for x in self.master: yield x for x in self.added: yield x
def __or__(self, s): return self.union(s)
def __and__(self, s): return self.intersection(s)
def __sub__(self, s): return self.difference(s)
def union(self, s): """docstring for union""" if isinstance(s, (set, frozenset)): return s | self.master | self.added elif isinstance(s, SeaSet): return self.master | self.added | s.master | s.added else: raise TypeError
def intersection(self, s): """docstring for intersection""" if isinstance(s, (set, frozenset)): return s & self.master & self.added elif isinstance(s, SeaSet): return self.master & self.added & s.master & s.added else: raise TypeError
def difference(self, s): """docstring for difference""" if isinstance(s, (set, frozenset)): self.deleted |= (self.master - s) self.master -= s self.added -= s elif isinstance(s, SeaSet): t = (s.master | s.deleted) self.deleted |= self.master - t self.master -= t self.added -= t else: raise TypeError
The surprising thing is that commits *ARE* running about 50% faster (according to the time column in the hotshot profiler). But, now, the longest running operations seem to be the I/O operations which are taking 10 times longer! (even if they're only reading or writing a few bytes. Could this have something to do with the set implementation being in Python as opposed to C?
For instance, this method: def __readTableHeader(self, f): hdr = f.read(sz__TABLE_HEADER_FORMAT__) if len(hdr) < sz__TABLE_HEADER_FORMAT__: raise EOFError t = THF_U(hdr) #t = unpack(__TABLE_HEADER_FORMAT__, hdr) return t
is now taking > 13s when it was taking less than 0.8s before! (same number of calls, nothing changed except the set implementation)
sz__TABLE_HEADER_FORMAT__ is a constant = struct.calcsize("
Prateek's gravatar image answered Apr 23 2007 at 05:17 by Prateek
En Mon, 23 Apr 2007 02:17:49 -0300, Prateek escribi?:
[...]
Hard to tell - you have posted only your SeaSet implementation, and no I/O is involved in that code.
I don't see where your SeaSet class is used.
Gabriel Genellina
Gabriel Genellina's gravatar image answered Apr 23 2007 at 07:08 by Gabriel Genellina
Actually that is the point. According to the hotshot profile, the problem code doesn't use the SeaSet implementation. Yet that same code was running much faster earlier. I tried multiple time (2-3 times).
Prateek's gravatar image answered Apr 23 2007 at 09:24 by Prateek
[Alex Martelli]
FWIW, I've posted a trial implementation in the Python cookbook: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/511496
Raymond Hettinger
Raymond Hettinger's gravatar image answered Apr 27 2007 at 05:24 by Raymond Hettinger