This project involves exploring datasets and thinking about spreadsheet design. The dataset used in this project comes from data.gov and concerns Graffiti removal in Chicago. Upon finding this dataset, I downloaded the CSV file.
Opening this CSV file in OpenRefine, I took a look at the data and made some cursory observations about it. The column headings include things like:
- Date the request was made to have graffiti removed,
- Status of the job,
- Date the job was completed,
- Request number,
- Type of surface the graffiti is on
- Where the graffiti is
- Street address
- X-Y coordinates
- Latitude and Longitude
- Ward
- Police District
- Community Area
Naturally, the where the graffiti is and on what type of surface is useful information for those tasked with removing/covering it up, as is the street address. The Police District, Ward, and Community Area information would also make it easier to find. Having X-Y Coordinates and Latitude and Longitude seems overkill, but perhaps that is useful information for a city as big as Chicago. Having three columns, one for latitude, one for longitude, and one for both, seemed unnecessary,
One column heading that definitely seemed redundant to this dataset is “type of service requested,” since the same info was entered for each entry, namely “graffiti removal.” This was confirmed by running a text facet on this column, which returned only the one choice.
Another curious choice was revealed by running another text facet on the column “where is the Graffiti located?” While this returned information about the most and least common places to be graffitied, it also turned up the detail that one of the requests was for an expressway with the note “DSS will NOT remove.”
This information was not included in the “status” field where I might have expected it, in fact this specific request is marked completed. Such is the information that can gathered when thinking about spreadsheet design, what information should be collected and how and where should it be entered.