cellio: (avatar)
Monica ([personal profile] cellio) wrote2014-07-10 07:46 pm
Entry tags:

learning new patterns

Coming to the world of SQL databases from the world of object-oriented programming is...different. I'm starting to realize why some idioms are different, and I'm sure there are tons more that I haven't noticed yet and am probably getting wrong. But that's what learning experiences are for.

Consider, for example, a system where you have authors with associated publications. If I were designing a system to track that in, say, Java, I would define an Author class and a Publication class, with bidirectional links (Author would have a collection of Publications; Publication would have a collection of Authors (because sometimes authors collaborate)). But in a database table design you don't do that; you define a Persons table that has columns for some unique ID, name, and anything else about the person, and you have a Publications table that has columns for things about the publication like a (book) unique ID, title, publisher, genre, etc, and also the unique ID from the Persons table for the author -- and I'm not sure if multiple authors means multiple rows in the Publications table or if there's some way to do collections. But the point is that a Person doesn't know about its publications -- when you want that you'll do a JOIN between the two tables and then you'll have what you need. Connections between flavors of data are external to the data. This makes sense, but it's going to take a little getting used to.

(Y'all who are way ahead of me on this should please feel free to point out any errors in the above and save me mis-learning some things. Thanks.)
siderea: (Default)

[personal profile] siderea 2014-07-11 01:20 am (UTC)(link)
Coming to the world of SQL databases from the world of object-oriented programming is...different.

AHAHAH, you could say that.

and I'm not sure if multiple authors means multiple rows in the Publications table or if there's some way to do collections.

You make a third table, called, say, "books_x_people" (or if you're me "books_x_authors"), that has two fields: one for a book id, and one for a person id. This captures the information "this person is an author of this book", by id. This supplants an author id field on the book table (which should be removed).

Now, you JOIN all three table, saying, for instance, "show me the data on the books table for the book of this id, and the set of people ids for all records on books_x_people that have this book's book id, and their corresponding names off the people table.
Edited 2014-07-11 01:21 (UTC)

[identity profile] zare-k.livejournal.com 2014-07-11 01:43 am (UTC)(link)
"I'm not sure if multiple authors means multiple rows in the Publications table or if there's some way to do collections."

Like [livejournal.com profile] siderea described, not if the database is properly normalized for transaction processing. I usually refer to these as "map tables". It's an easy way to represent relationships between any number of entities.

Storage for analytic scenarios can be a different matter... data may be denormalized back for aggregation. Looks like I will be learning more about this soon myself.


"Coming to the world of SQL databases from the world of object-oriented programming is...different."

It is just as weird coming from the other direction :)

[identity profile] goldsquare.livejournal.com 2014-07-11 02:20 am (UTC)(link)
It is different, very.

SQL is about relationships. So you have authors and lots of them. And publications, many of those.

Then you define (or project) a relationship between them. You can create the relationships you will need most in advance and stuff it in a table, or you can create them dynamically with queries.

I hope that rather watered down description helps make it more intuitive.

[identity profile] goldsquare.livejournal.com 2014-07-11 02:36 am (UTC)(link)
I learned SQL in 1981.

I hate getting old. :-)

I did enjoy the job interview for database QA engineers where they asked me some trivia question and I answered "you think the correct answer is A and I am tempted to give it to you but is B and I can explain why if you like".

I did get the contract job.

[identity profile] goldsquare.livejournal.com 2014-07-11 02:38 am (UTC)(link)
You can get amazing performance out of SQL as long as you are well to compromise on purity of theory. :-)
siderea: (Default)

[personal profile] siderea 2014-07-11 02:46 am (UTC)(link)
I usually refer to these as "map tables".

I was sufficiently intrigued by the sheer diversity of terms for these things, that I held a poll on it. "Join table" was the clear winner, but "map", "link(|ing|y])", and "(x|cross|)ref(|erence])" all had multiple endorsements.
Edited 2014-07-11 02:47 (UTC)
siderea: (Default)

[personal profile] siderea 2014-07-11 02:49 am (UTC)(link)
I recently heard somebody recommend denormalizing data to solve some problem!

Just to make clear: denormalization is the necessary and normal output of any interesting query. Denormalization is fine in output.

Storing data denormalized... denormalizing is the devil you make deals with, see.

ETA: Wait, what do you mean by "denormalization"? In RDBMSs, it means things like "multiple authors means multiple rows in the Publications table". Your result set of "books and their authors" will look like:
The C Programming Language                        | Kernighan
The C Programming Language                        | Ritchie
Structure and Interpretation of Computer Programs | Abelson 
Structure and Interpretation of Computer Programs | Sussman

But you don't ever store it that way, except in some sort of cache, because data hygiene gets hard/impossible.
Edited 2014-07-11 03:03 (UTC)

[identity profile] zare-k.livejournal.com 2014-07-11 04:58 am (UTC)(link)
"I recently heard somebody recommend denormaliing data to solve some problem!"

Generally motivated by performance concerns. There is some cost to joining a bunch of entities back together for aggregation. Although, you will occasionally hear a blanket "joins are bad!" which is just silly... especially with current RDBMSes that have very advanced query optimizers.

[identity profile] zare-k.livejournal.com 2014-07-11 05:01 am (UTC)(link)
Interesting! I hear "join table" a lot also.
dsrtao: dsr as a LEGO minifig (Default)

[personal profile] dsrtao 2014-07-11 02:08 pm (UTC)(link)
Your schema holds the data you think is important, but in any non-trivial project, this will change repeatedly over the course of the work. An awful lot of people seem to think that it's a good idea to start by defining your schema, and then never think about it again. They are almost certainly wrong. However, if your SQL is tightly linked into your code, it will be difficult to change your schema without changing everything else.

That's why everybody* uses a toolkit. If it's a good toolkit, it does what you want long-term, including schema migrations and reversions and differentiating test environments from production. If it's a bad toolkit, it does what you want short term but breaks when you outgrow it. And if it's an awful toolkit, it doesn't do anything better than writing a library of SQL statements with placeholders and bind variables wouldn't do better.

https://en.wikipedia.org/wiki/List_of_object-relational_mapping_software is not comprehensive, but is a plausible overview.



*No, not really everybody. But most people on most projects.

[identity profile] dvarin.livejournal.com 2014-07-17 12:57 am (UTC)(link)
Huh. Possibly I spend too much time with ORM frameworks, but conceptually (data-only) objects and tables seem pretty much the same to me. They're both just collections of fields. The tricky bits happen when you try to represent object relations, which is what you're asking about, and other people have explained better.
The way I deal with it practically is that I have a little mental decision tree with about three branches based on the type, multivaluedness, and reciprocity of the object reference field, and then each leaf on the tree maps to some archetypical RDB structure that represents that reference type. It doesn't actually take that long to build this up.

goldsquare's comment that relationships in RDBland can be dynamic queries is worth emphasizing. If in objectland you have a Patient who has a list of Prescription, the Patient has a List<Prescription> multivalue objref and the Prescription has a "patient" objref. In RDBland this manifests itself as a "patient_id" foreign key column on the prescription table and nothing in the patient table. Physically, only prescriptions reference patients. Conceptually, a patient is linked to its prescriptions by the query SELECT * FROM prescription WHERE patient_id = [the patient id], which, because this is an RDB and you probably indexed that column, is reasonably fast.