2
13

public library: “we are not going to export 100s of 1000s of book & media records for an open data request for just one person”

24d 12h ago by lemmy.sdf.org/u/evenwicht in glam@lemmy.cafe

cross-posted from: https://lemmy.sdf.org/post/53671520

I asked a library system to furnish their whole catalog of books, music, and movies in an open format (JSON, XML, or CSV). They refused, saying that the database is extremely large, composed of several hundred thousand bibliographic records that reference over 2 million documents. They say the database is highly dynamic and it would be obsolete by the time they export the data and likely not useful to more than one person.

So they have opted to limit everyone to using their web-based search. Is my request unreasonable? Or their response?

I’m trying to get a basic idea of the size we are talking about. I’m guessing 100,000 bibliographic records would consume roughly 100mb uncompressed (guessing an avg. record would not exceed 1k). And since text compresses very well, a zipped JSON would be what, ~10mb per 100k records? I believe a zip file of 900,000 bibliographies would be ~65mb.

The library did not give precise figures but I would like to work out what level of crazy my request is. Do any libraries in the world export a dataset of 100s of 1000s of book and media titles? Because if it’s done /somewhere/, it would give a clue about the reasonableness of my request.

I’ll give a couple use cases in case anyone is wondering how direct DB access would be useful.

Use case 1:

  1. fetch a list of titles of interest, e.g. award-winners (books, scripts, actors, musicians, directors, etc), or a list of banned books, because if it’s banned somewhere maybe it piques your curiosity
  2. search the library’s DB for matches against a list

If the list is more than ~15 or so items, you’re fucked because library query forms rarely accept a list as input. And as soon as you need to specify other criteria like works in English with a date range, the chance of a web form doing the job becomes increasing unlikely.

Use case 2: Suppose you are boycotting something or want to avoid something or someone (e.g. you want to avoid Tom Hanks because he is a sell-out with no sense of brand protection, who will act in any garbage film if it pays enough)

  1. fetch a list of titles you want to avoid (e.g. if you boycott Disney, get a list of Disney titles; or get a list of movies Tom Hanks was in)
  2. search the library’s DB for whatever you are looking for, but exclude matches against a list

Or you have a looooonng list of movies you have already seen or books you have read. Obviously you might want to exclude them from your queries.

Use case 3: The library has an extremely limited sense of genres. A conversation went like this:

Me: “Where is the EDM section? Where is the ambient and trip-hop section?” Librarian: “what’s that?” Me: Electronic music. Librarian: those would be under “rock”. Me: What about world music, like Ravi Shankar (classical Indian)? Librarian: check jazz

Fuck me. No wonder the rock and jazz sections are so huge and there’s little else. Picking through it would be unsurmountable and the web DB likely has the same sloppy genre problem. I suspect what has happened is young ppl just don’t do libraries much and they probably use Spotify or similar online surveillance system for music. In fact I rarely even see people browsing the music these days. So the library organisation just did not keep up genres and no one noticed because they are online. So again, like use case 1 it would be useful to find the intersection between a list of titles of interest and the library DB.

I have to wonder if the /real/ problem is that the library thinks I would be the sole user of the exported DB. I can understand resistence to doing a significant amount of work for just one person. But I would expect many people to have search needs that these GUI webforms cannot handle, no? And from there it would be the subset of those people who know SQL.

No shit?

Unless you are talking about a library in an urban area where it is heavily funded, your library may have an incredibly limited tech team. It's still possible that they are running all vendor software with a team of mostly contractors or shared with the rest of the region.

Someone would need to query the data and encode it to JSON/etc. on the fly. There may not be a way to do this without having one of the programmers directly connect to the DB and run custom queries. The data is likely spread in several tables, with the bibliographies in a separate large text only section.

To dump everything blindly would be irresponsible. It's possible the DB could lock up or that incorrect/inappropriate data could be queried. They should do a few test queries to determine the size of the data, then break it up into chunks to pull a little at a time and QC/stitch it together.

Their software is either homebrew or off the shelf. If it's off the shelf, they may need to put in a support ticket and have their vendor figure it out. It's highly unlikely there is a government employee or team sitting around and waiting for your request.

It's not that your concerns are invalid, but I believe that the library has a point about not performing this amount of work on an individual request. Have like-minded individuals request the changes you want to the library system. The library is there to serve the public at large, but large projects like this need high level buy in, funding, and hands.

Someone would need to query the data and encode it to JSON/etc. on the fly. There may not be a way to do this without having one of the programmers directly connect to the DB and run custom queries.

I use sqlite. They are probably using some heavier duty db but for sqlite exporting JSON is trivial so I would be surprised if other DBs did not have a similar mechanism. And to be clear, I said to the library that I prefer JSON but would handle whatever open format they prefer, be it XML or CSV.

To dump everything blindly would be irresponsible. It’s possible the DB could lock up or that incorrect/inappropriate data could be queried.

This does not sound like a realistic problem. I might imagine if they had a DB of all ISBNs, they would obviously have to use a query that limits to their catalog. Apart from that, I don’t see what would be inappropriate. If it’s in their catalog, why hide it? Not sure what you have in mind but I should say it’s not the US where there would be some right wing concern to prevent children from getting sex education type of material, or the Christian right trying to make Darwin’s theories hard to reach.

If you are thinking in terms of sensitive info, like accounts of people and what they borrow, it would be irresponsible if that kind of info were not in a separate table.

Their software is either homebrew or off the shelf. If it’s off the shelf, they may need to put in a support ticket and have their vendor figure it out. It’s highly unlikely there is a government employee or team sitting around and waiting for your request.

I was expecting my request to be ignored, as open data requests often are -- and rarely fufilled in my experience even when they answer. But in the case at hand, they first responded favorably, saying essentially: we can give you some data but your request is vague.. what exactly do you want? I basically replied with “everything”. So they were not opposed to exporting some data, but the volume involved (100s of 1000s of records) seems to be a show-stopper.

I believe that the library has a point about not performing this amount of work on an individual request.

I might agree that it’s a bit much to serve one person. They also said it would take disproportionate resources when they have a whole public to serve. But I was figuring “build it, and they will come”. There would be a first person to make a request. I am a bit disappointed that if it were made available that we could not expect many people to exploit the option to be free from the UIs limitations.

It is simple in sqlite (which is purpose-built to be simple and small,) so you assume all other databases are equally simple. You then expect library staff to be standing by ready to help with your demands.

Well prepare to be shocked: That expectation is absurdly naive and self-centered. YTA

It is simple in sqlite (which is purpose-built to be simple and small,) so you assume all other databases are equally simple.

It’s the other way around. I expect a simple DB to be more basic. A more complex DB should be even more featured. If, for example, an Oracle DB cannot easily handle the job that a small and simple home kit can, Oracle should be embarrassed.

and self-centered.

Yikes! What leads you to think this is about me? It’s about databases. I did not invent sqlite. It was an example. You can fuck off with your vitriol.

Is my request unreasonable?

Yes, it is unreasonable. As you have already been told. You asked, but didnt like the answer. Again, you have only a rudimentary understanding of the problem but base everything off your experience and your needs.

You asked, but didnt like the answer.

I was looking for good answers. Convincing answers. Which I expected to correspond with data volume.

In any case, good answers are defensible, should the occasion arise. When you cannot defend your answers, it indicates a lack of justified confidence despite an expectation that others adopt some kind of blind confidence in your answers.

The thread has those aplenty, it is just that you are confidently incorrect so you dont like the answers.

I'd rather guess it's related to them having no one to actually do the work. it's likely they bought a commercial opac (if that's still used), paid someone to import their legacy data, and then update the system as part of their daily business. if their system does not offer any export functionality right from the bat, you are out of luck.

what about scraping (slowly, if you want to be nice) via their web interface?

My wife is a librarian and I can assure you, she doesn't stand around all day waiting for people to come in and ask questions. She can barely finish her tasks in the time provided. Libraries are busy places and many things are happening that the public doesn't see.

I'm sorry, i did not want to imply anything like that.

i was just speculating that most libraries are users of opac like systems and not developers of these systems.

Sorry I meant to add to your comment as further reasons the request would be hard.

I might try scraping. It’s not my 1st choice but it might be a viable plan B.

Guess I will mention that the library search page blocks Tor and my machine does not work on clearnet, which was yet another motivating factor in my request. But I can always work around that on a one-off basis.. to either scrape or grab a whole DB.

They likely don’t have any technologist on staff capable of direct db exports or any kind of scripting. So they complete these requests manually, which would be excessively burdensome.