I have a bit of a specialized request.
I'm reading a table of strings (specifically fixed length 36 char uuids generated via uuid.uuid4() in the standard library) from a file and creating a set out of it. Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the new items. Currently I am creating a copy of the set as soon as I load it and then when I go back to save it, i'm calculating the difference and saving just the difference.
There are many problems with this approach so far: 1) Calculating the difference via the standard set implementation is not very scalable -> O(n) I presume 2) Maintaining a copy wastes memory 3) I don't have a good solution if I delete items from the set (calculating the difference will return an empty set but I need to actually delete stuff).
I was thinking of writing a specialized set implementation (I have no idea how to start on this) which tracks new items (added after initialization) in a separate space and exposes a new API call which would enable me to directly get those values. This is kind of ugly and doesn't solve problem 3.
I also thought of using a hastable but I'm storing many ( > 1 million) of these sets in the same file (most of them will be empty or contain just a few values but a few of them will be very large - excess of 10,000 items). The file is already separated into tablespaces.
Does anyone have any ideas?
Thanks in advance. Prateek PS: Yes - I need blazing fast performance - simply pickling/unpickling won't do. Memory constraints are important but definitely secondary. Disk space constraints are not very important.
Prateek's gravatar image asked Apr 22 2007 at 03:13 in Python by Prateek

9 Answers

Yep, you do have to look at every item, so O(N) is an obvious lower bound.
(3) is easy -- the difference originalset-finalset is the set of things you have to delete.
Your problem is really a thinly disguised version of one of the most ancient and venerable ones in data processing -- the "master file / transaction file" problem. With sets (which are built as hash tables) you have very powerful weapons for this fray (it used to be, many years ago, that sorting and "in-step processing of sorted files" was the only feasible approach to such problems), but you still have to wield them properly.
In your shoes, I would write a class whose instances hold three sets:
You can implement a set-like interface; for example, your .add method will need to remove the item from the deleted set (if it was there), then add it to the added set (if it wasn't already in the master set); and so on, for each and every method you actually desire (including no doubt __contains__ for membership testing). E.g., with suitably obvious names for the three sets, we could have...:
class cleverset(object): ... def __contains__(self, item): if item in self.deleted: return False elif item in self.added: return True else: return item in self.master
When the time comes to save back to disk, you'll perform deletions and additions based on self.deleted and self.added, of course.
I'm not addressing the issue of how to best represent such sets on disk because it's not obvious to me what operations you need to perform on the on-disk representation -- for example, deletions can be very costly unless you can afford to simply set a "logically deleted" flag on deleted items; but, depending on how often you delete stuff, that will eventually degrade performance due to fragmentation, and maintaining the indexing (if you need one) can be a bear.
Normally, I would use a good relational database (which has no doubt already solved on my behalf all of these issues) for purpose of persistent storage of such data -- usually sqlite for reasonably small amounts of data, PostgreSQL if I need enormous scalability and other "enterprise" features, but I hear other people are happy with many other engines for relational DBs, such as Oracle (used to be superb if you could afford full-time specialized db admins), IBM's DB2, SAP's database (now opensourced), Firebird, even MySQL (I've never, but NEVER, had any good experience with the latter, but I guess maybe it's only me, as otherwise it could hardly be so popular). The main point here is that a relational DB's tables _should_ be already very well optimized for storing "sets", since those table ARE exactly nothing but sets of tuples (and a 1-item tuple is a tuple for all that:-).
Alex
Alex Martelli's gravatar image answered Apr 22 2007 at 05:14 by Alex Martelli
This may be a silly question, but why? Why not just save the modified set, new items and old, and not mess about with complicated transactions?
After all, you say:
Since disk space is not important, I think that you shouldn't care that you're duplicating the original items. (Although maybe I'm missing something.)
Perhaps what you should be thinking about is writing a custom pickle-like module optimized for reading/writing sets quickly.
Steven.
Steven D'Aprano's gravatar image answered Apr 22 2007 at 06:09 by Steven D'Aprano
I tried just that. Basically ignored all the difficulties of difference calculation and just overwrote the entire tablespace with the new set. At about 3000 entries per file (and 3 files) along with all the indexing etc. etc. just the extra I/O cost me 28% performance. I got 3000 entries committed in 53s with difference calculation but in 68s with writing the whole thing.
I already did this. I'm not using the pickle module at all - Since I'm guaranteed that my sets contain a variable number of fixed length strings, I write a header at the start of each tablespace (using struct.pack) marking the number of rows and then simply save each string one after the other without delimiters. I can do this simply by issuing "".join(list(set_in_question)) and then saving the string after the header. There are a few more things that I handle (such as automatic tablespace overflow)
Prateek
Prateek's gravatar image answered Apr 22 2007 at 10:35 by Prateek
Now why didn't I think of that?
Ok. This class sounds like a good idea. I'll try it.
I actually don't have very many transactions I need to perform on disk. Once I've loaded these sets, I normally use the contains operation and union/intersection/difference operations. I can definitely use a tombstone on the record to mark it as logically deleted. The problem right now is that I don't need indexing within sets (I either load the whole set or don't bother at all) - I'm only indexing the entire file (one entry in the index per tablespace). So even if I use tombstones, I'll have to still go through all the tablespace to find the tombstoned files. Alternatively, I could write the deleted data at the end of the tablespace and mark it with the tombstone (since I don't care about space so much) and when I load the data, I can eliminate the tombstoned entries. I'm give that a try and report my performance results on this thread - I absolutely love that I can make this important a change in one or two days using Python!
Thanks Alex, but we're actually implementing a (non-relational) database engine. This particular file is one of many files (which are all kept in sync by the engine) I need to persist (some files use other methods - including pickling - for persistence). I've already solved most of our other problems and according to hotshot, this is
total CPU time @ 3000 items and 3 files). If you're interested, you can check it out at http://www.brainwavelive.com or email me.
Prateek
Prateek's gravatar image answered Apr 22 2007 at 10:50 by Prateek
In article ,
Why are you reinventing the wheel? Why not just implement your functionality on top of an existing database as your backing store?
Aahz (aahz at pythoncraft.com) http://www.pythoncraft.com/
Help a hearing-impaired person: http://rule6.info/hearing.html
Aahz's gravatar image answered Apr 22 2007 at 16:22 by Aahz
Oh dear god, I implemented this and it overall killed performance by about 50% - 100%. The same script (entering 3000 items) takes between 88 - 109s (it was running in 55s earlier).
Here is the new Set implementation: class SeaSet(set): __slots__ = ['master', 'added', 'deleted'] def __init__(self, s=None): if s is None: s = [] self.master = set(s) self.added = set() self.deleted = set()
def add(self, l): if l not in self.master: self.added.add(l) try: self.deleted.remove(l) except KeyError: pass
def remove(self, l): try: self.master.remove(l) self.deleted.add(l) except KeyError: try: self.added.remove(l) except KeyError: pass
def __contains__(self, l): if l in self.deleted: return False elif l in self.added: return True else: return l in self.master
def __len__(self): return len(self.master) + len(self.added)
def __iter__(self): for x in self.master: yield x for x in self.added: yield x
def __or__(self, s): return self.union(s)
def __and__(self, s): return self.intersection(s)
def __sub__(self, s): return self.difference(s)
def union(self, s): """docstring for union""" if isinstance(s, (set, frozenset)): return s | self.master | self.added elif isinstance(s, SeaSet): return self.master | self.added | s.master | s.added else: raise TypeError
def intersection(self, s): """docstring for intersection""" if isinstance(s, (set, frozenset)): return s & self.master & self.added elif isinstance(s, SeaSet): return self.master & self.added & s.master & s.added else: raise TypeError
def difference(self, s): """docstring for difference""" if isinstance(s, (set, frozenset)): self.deleted |= (self.master - s) self.master -= s self.added -= s elif isinstance(s, SeaSet): t = (s.master | s.deleted) self.deleted |= self.master - t self.master -= t self.added -= t else: raise TypeError
The surprising thing is that commits *ARE* running about 50% faster (according to the time column in the hotshot profiler). But, now, the longest running operations seem to be the I/O operations which are taking 10 times longer! (even if they're only reading or writing a few bytes. Could this have something to do with the set implementation being in Python as opposed to C?
For instance, this method: def __readTableHeader(self, f): hdr = f.read(sz__TABLE_HEADER_FORMAT__) if len(hdr) < sz__TABLE_HEADER_FORMAT__: raise EOFError t = THF_U(hdr) #t = unpack(__TABLE_HEADER_FORMAT__, hdr) return t
is now taking > 13s when it was taking less than 0.8s before! (same number of calls, nothing changed except the set implementation)
sz__TABLE_HEADER_FORMAT__ is a constant = struct.calcsize("
Prateek's gravatar image answered Apr 23 2007 at 05:17 by Prateek
En Mon, 23 Apr 2007 02:17:49 -0300, Prateek escribi?:
[...]
Hard to tell - you have posted only your SeaSet implementation, and no I/O is involved in that code.
I don't see where your SeaSet class is used.
Gabriel Genellina
Gabriel Genellina's gravatar image answered Apr 23 2007 at 07:08 by Gabriel Genellina
Actually that is the point. According to the hotshot profile, the problem code doesn't use the SeaSet implementation. Yet that same code was running much faster earlier. I tried multiple time (2-3 times).
Prateek's gravatar image answered Apr 23 2007 at 09:24 by Prateek
[Alex Martelli]
FWIW, I've posted a trial implementation in the Python cookbook: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/511496
Raymond Hettinger
Raymond Hettinger's gravatar image answered Apr 27 2007 at 05:24 by Raymond Hettinger

Related Discussions

  • Type-checking Support In Python? in Python

  • I'm just re-learning Python after nearly a decade away. I've learned a good healthy paranoia about my code in that time, and so one thing I'd like to add to my Python habits is a way to both (a) make intended types clear to the human reader of the code, in a uniform manner; and (b) catch any type errors as early and automatically as possible. I found the "typecheck" module (http://oakwinter...

  • Histogram Type Thingy For (unique) Dict Items in Python

  • hi. I've been banging my head against this one a while and have asked around, and i am throwing this one out there in the hopes that some one can shed some light on what has turned out to be a tough problem for me (though i am getting closer). i have been mucking with a lot of data in a dictionary that looks like: events = { (2, 1, 0) : [8, 3, 5, 4], (2, 7, 0) : [4, 3, 2, 2], (2, 14, ...

  • Problem: New Type In Python 2.2 in Python

  • In-Reply-Message-ID: [Dave Kuhlman] Cheer up, it makes no sense at all . This is usually a symptom of a function in an extension module setting an error condition but neglecting to return NULL (or whatever the proper error return is in context). A common way for that to happen is for an extension to neglect to check the return of an API function *it* calls for errors (every call must be checked...

  • Problem: New Type In Python 2.2 in Python

  • I've implemented a new Python type (in C). It's based on Objects/xxobject.c in the source code distribution. It implements a wrapper for a compiled XSLT stylesheet and is part of the Python support for libxslt. (See http://www.rexx.com/~dkuhlman) It no longer works with Python 2.2. Here is the finger-print of the error: o o o File "test_to_file_compiled.py", line 28, in translate...

  • Defining A New Base-type In Python Based Off Hexavigesimal in Python

  • Good evening, I have defined a new numbering structure for certain mathematical advantages. How do I implement this in Python, or would I be better off writing this in C or C++? Ultra concise definition: http://i42.tinypic.com/af7w4h.png LaTeX source: http://pastebin.tlhiv.org/Kf6jPRkI Thanks for all suggestions, Alec Taylor...

  • Set Type For Datetime Intervals in Python

  • Hello, I need to compare sets of datetime intervals, and make set operations on them: intersect, union, difference etc. One element of a set would be an interval like this: element ::= (start_point_in_time, end_point_in_time) intervalset ::= { element1, element2, .... } Operations on elements: element1.intersect(element2) element1.union(element2) element1.minus(element2) Operations on sets: ...

  • Set Type? in Python

  • How do you check to see if a variable is a set? I would like to use if type(var) is types.SetType: blah but that is not available in types module. I am using 2.4...

  • Type Converion In Python in Python

  • Hi, Thanks everyone for your inputs. I actually wasn't thinking about writing it from scratch. :) I'm still new to this python language, so just didn't realize there's a struct module with pack/unpack to do the job for me. I've solved my problem with this struct module! Thanks again for your help. Winnie :)...

  • Type Converion In Python in Python

  • Hi, I'm writing a socket server in python and I'm having some trouble converting the data received. Another socket client, which is written in C, sends data to a particular ipaddress and port number and my job is to parse the data received. Now the problem is, the data sent from the client is an array of uint2. If the server is written also in C/C++, then I can cast the char* to uint2 * and...

  • Type Checking In Python? in Python

  • It would be nice if I could do something like: def my_func(int foo, list bar, Quux quux): ... rather than def my_func(foo, bar, quux): if not isinstance(quux, Quux): raise RuntimeError("param 'quux' must be of class 'Quux'") Is there any plans on doing this for Python 3000? Or will something like this never be a part of Python?...

  • Type Checking In Python? in Python

  • It would be nice if I could do something like: def my_func(int foo, list bar, Quux quux): ... rather than def my_func(foo, bar, quux): if not isinstance(quux, Quux): raise RuntimeError("param 'quux' must be of class 'Quux'") Is there any plans on doing this for Python 3000? Or will something like this never be a part of Python?...

  • Type Checking In Python? in Python

  • Most list-like Python objects do not derive from UserList. We could argue whether they should or should not but the point is that your syntactic change would only provide maximum benefit if Python programmers change their programming style. That's why it is not something that can be implemented overnight. Paul Prescod - Not encumbered by corporate consensus It's difficult to extract sense from ...

  • Type Checking In Python? in Python

  • There is a whole special interest group with dozens of members set up to discuss this issue. Nevertheless, its more complicated than you can imagine and progress is slow. Paul Prescod - Not encumbered by corporate consensus It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html...

  • Type Checking In Python? in Python

  • Let me also point out that in many cases, this style of programming would be frowned upon by serious Python programmers. For instance, if you check that something is a string, your code will complain when it is handed a Unicode string, even though it would probably work fine. If you check that it is an open file, then your code probably will complain about stringIO file-like objects, even though it...

  • Type Checking In Python? in Python

  • In article , Paul Prescod That's another reason why I'd like the type checking to be done by the Python compiler, because it could be intelligent about those things. Like for "list" it could accept a tuple, a list, or anything derived from UserList....

  • Type Checking In Python? in Python

  • * Paul Prescod menulis: | > | > def my_func(foo, bar, quux): | > if not isinstance(quux, Quux): | > raise RuntimeError("param 'quux' must be of class 'Quux'") | | Let me also point out that in many cases, this style of programming | would be frowned upon by serious Python programmers. For instance, if | you check that something is a string, your code will complain when it is | handed...

  • Type Checking In Python? in Python

  • In article , Matthew says... That's plain C ... if not quux.isinstance(Quux): is more flexible ... so you have just to implement that method in your class definition Regards Armin...

  • Adding A 'struct' Into New Python Type in Python

  • Hi, I'm following this example : http://nedbatchelder.com/text/whirlext.html#h_making_a_type and trying to add new data into 'CountDict' type Adding a simple 'char' works well. typedef struct { PyObject_HEAD PyObject * dict; int count; char c; //add this and placed an entry into PyMemberDef as T_CHAR. } CountDict; I can access 'c' from python code,no issues so far. Now I want to...

  • Problem Using New Python Type: "cannot Add Type To String" in Python

  • In-Reply-Message-ID: If the function is testing for presence of a PyStringObject then you have to pass it a PyStringObject. Call PyObject_Str() or PyObject_Repr() on your object then use Py_BuildValue() to build your argument tuple. Don't forget to decrement the refcounts when you're done. Ignacio Vazquez-Abrams...

  • Problem Using New Python Type: "cannot Add Type To String" in Python

  • Hi, I created a new Python type (extension in C), and tried to call a function with an object of the new type as the argument: PyObject *arglist = PyTuple_New(1); PyTuple_SetItem(arglist, 0, myObj); /* myObj is an object of the new type */ PyObject_CallObject(funcObj, arglist);... However, the PyObject_CallObject() failed with the error message saying that: "cannot add type to string". It seems...