How to improve NYC’s open data law 2.0?

Last week, we had the chance to join with members of the NYC Transparency Working Group and provide feedback on the future of NYC’s open data law (video). As this bill is under discussion we encourage all to provided your thoughts via councilmatic.

Intro 1525-2017
This bill would require updates to the agency compliance plan, to include the names of public datasets provided in response to Freedom of Information Law requests when such datasets were not included on the Open Data Portal.

Intro 1707-2017
This bill would extend the time agencies have to complete their open data compliance plan and publish data on the open data portal; it would codify agency’s existing practice of designating an employee to be the agency’s open data coordinator; and it requires the Department of Information Technology and Telecommunications to collect, analyze, and publish site analytics of the open data portal.

In general, BetaNYC is happy with both bills and very supportive of efforts from the Administration and Council to keep NYC’s open data program #1 in the nation. You can hop into this google doc and provide comments on our testimony (PDF – Testimony to NYCC – Intro 1528-2017 2017.09.20 Open Data 2.0 Bill).

Summary of BetaNYC’s testimony:

Support 1707-2017

We have concerns about extending the deadline to 2021.
We have concerns around the broad user/usage reporting language. The language needs to provide transparency around use while protecting people’s privacy.
We encourage the City to adopt permissive copyright around data sets and data products. We highly recommend the adoption of Creative Commons Zero and the GNU General Public License v. 3.0.
We call for a public right of action to ensure data accessibility across Administrations.
We call for clarity around dataset compliance and the creation of a simpler interface to access which datasets are in compliance with existing open data reporting laws and which data sets have geocoded elements.

Support 1528-2017

We are very interested in knowing which datasets are derived from the City’s unified FOIL process.

Additional statement added as oral testimony:

For the last few years, we’ve discovered a number of data issues related to geocoding. Geocoding is the process of translating a point into a addresses or attributes, or vice-versa. Geocoding is how one turns an address into a specific location or finds out the community board / council district of a point.

New York City is unique in providing a free municipal geocoder accessible to municipal agencies and the general public. We both use NYC Geosupport produced by NYC Planning. Frustratingly, we continue to come across a number of issues that are not easily addressed. We call for the Administration and Council to provide adequate resources to turn this main-frame tool into a modern, open source tool that will enable to public to point out problems and collaborate to fix them. We call for transparency around the data, code, and process. For New Yorkers to know where they are going, they must know where they are.

Additionally, we want to point out testimony from Sumana Harihareswara:

I’m Sumana Harihareswara, and I am not here representing anyone other than myself. I’m a programmer, an open source software expert, and a New Yorker who cares a lot about open data. I appreciate the work you all have done regarding open data in NYC, including these bills.

Regarding Int 1707, I want to say three things.

First, I agree with the representative from the Transparency Committee regarding the 2018 deadline. If it’s possible to keep the 2018 deadline, that would be good to avoid giving the agencies a really long homework extension, as he said.

Second, regarding a licensing provision, I agree with Mr. Webber. When I’m looking at datasets, trying to figure out whether they’re available for me to use, I really appreciate seeing permissive licenses. I’m checking whether it’s legal for me to reuse a dataset, to remix it into an application, to use it in a presentation, and so on, and a license that’s compliant with the Open Definition
(opendefinition.org) is an easy marker I can check to make sure. It’s like a way to brand the dataset so I know it’s truly open and reusable.

And third, I agree with what Mr. Hidalgo said about the analytics provision. Right now the wording says: “Such data shall include, but need not be limited to, page views, unique users and the location from which a user accesses such portal.” As Mr. Hidalgo said, “the location” could be construed as meaning individual IP addresses, which in many cases are as personally identifiable as individual street addresses. Many internet security experts are basically now treating IP addresses as PII, Personally Identifiable Information, and recommending that we treat IP addresses with that level of confidentiality and use retention policies to delete those records regularly.

I’d suggest that the language give better parameters for what it means when it says “unique users” and “location”. We probably want the numbers of unique users, but we should not actually be publishing enough to identify them — I would not want to be publicly identified as someone who looked up a dataset about a sensitive topic. For “location” — what level of specificity do we want? New York City versus outside New York City? What borough a user is in? What community board district? That’s the level of specificity I could imagine executives and agencies using to figure out who’s using the data and what kinds of initiatives to encourage and incentivize.

Thank you very much.