OpenRefine, Stephanie Falkowski

Posted 4 years ago by sfalkowski

This project involves exploring datasets and thinking about spreadsheet design. The dataset used in this project comes from data.gov and concerns Graffiti removal in Chicago. Upon finding this dataset, I downloaded the CSV file.

The download CSV file button on the data.gove website

Opening this CSV file in OpenRefine, I took a look at the data and made some cursory observations about it. The column headings include things like:

Date the request was made to have graffiti removed,
Status of the job,
Date the job was completed,
Request number,
Type of surface the graffiti is on
Where the graffiti is
Street address
X-Y coordinates
Latitude and Longitude
Ward
Police District
Community Area

Naturally, the where the graffiti is and on what type of surface is useful information for those tasked with removing/covering it up, as is the street address. The Police District, Ward, and Community Area information would also make it easier to find. Having X-Y Coordinates and Latitude and Longitude seems overkill, but perhaps that is useful information for a city as big as Chicago. Having three columns, one for latitude, one for longitude, and one for both, seemed unnecessary,

chart showing Latitude, Longitude, and both together — latitude and longitude information split,
but next to a column listing the coordinates together

One column heading that definitely seemed redundant to this dataset is “type of service requested,” since the same info was entered for each entry, namely “graffiti removal.” This was confirmed by running a text facet on this column, which returned only the one choice.

Facet showing graffiti removal as the only option for the type of service request column — “Graffiti Removal” – the only choice for
“Type of Service Request” in this Spreadsheet

Another curious choice was revealed by running another text facet on the column “where is the Graffiti located?” While this returned information about the most and least common places to be graffitied, it also turned up the detail that one of the requests was for an expressway with the note “DSS will NOT remove.”

Facet on "Where" returns unexpected result — Facet on “Where is the Graffiti located?” returns an unexpected result – Is this the best place for the information that DSS will not remove the Graffiti?

Request marked "Will not remove" listed as "Complete" — Request with the “Will NOT remove” note marked as “Completed”
The “Status” column may have been the more obvious place to put the “will not remove” info.

This information was not included in the “status” field where I might have expected it, in fact this specific request is marked completed. Such is the information that can gathered when thinking about spreadsheet design, what information should be collected and how and where should it be entered.

More on OpenRefine

Posted 4 years ago by sfalkowski

For my last post on using OpenRefine, I worked with a dataset of the British Library’s comic book holdings found on thomaspadilla.org.

go to thomaspadilla.org

This site provides guidance for some of the ways in which OpenRefine can be used to clean data. It has walk-throughs of using features including the text filter, facet tool, clustering, and transform on the data there provided. For my post, I used some of these same techniques to find answers to other questions. I questioned how many of the records listed had a less-than-certain place of publication recorded. I did not take into account for the purposes of this exercise those records where the place of publication was unknown to the point of leaving the field blank – of which, there are 2050 – but only those that offered a place followed with a qualifying question mark. There were only 233 that fit the description, and with OpenRefine I was also able to tell that most of them, 183, were given the designation of “London?” followed distantly by the broader placeholder of “England?” that was used in only fourteen instances, and others used less than four times each.

chart showing facet tool sorting most common questioned places of publication

OpenRefine offers many ways of manipulating messy, inconsistent data to render it usable to answer a variety of questions. It was not necessary to clean the data completely, but only those aspects determined by the question being asked, which in this case meant manipulation of the Place of Publication column, though plenty of further cleaning could be done in other equally inconsistent columns to open it to even further questions.

OpenRefine with British Library Data

Posted 4 years ago by sfalkowski

Using OpenRefine

A certain data set on the British Library’s comic book holdings has a number of issues that makes using this data to answer specific questions rather difficult. In this post, I will use OpenRefine to clean and sort the data such that it can answer a specific question, in this case, “How many of the British Library’s comic books have a questioned place of publication listed?”

As above noted, this data has numerous issues, including:

Authors: some authors names are given last, first; some have a period after the name; some a comma after the name; sometimes the same name is entered in different formats
Place of Publications: some give only city names, others city and state, others just a country; state names are abbreviated in various ways; some publication places are bracketed; some have multiple entries.
Publisher: the same publisher is not always entered the same way, e.g. Titan, Titan Books, Titan [distributor], et.al. all refer to the same publisher
Date of Publication: some dates end with a period, some do not, some are bracketed, some are circa dates
Some entries do not have information for all the fields

The basic problem is in the inconsistent ways information is entered in the spreadsheet.
Not all of these problems need to be resolved to answer the question regarding comics whose place of publication is uncertain, but some of the inconsistencies are best eliminated.

Many of the place of publication entries are bracketed, and if we want to look deeper into the question, we probably want “London?” and “[London?]” counted together.
To do this, I used the the transform tool and told it to remove the brackets in this column. The previews show “[London?]” being transformed to just “London?”

Our question can be answered by using the filter text tool in the column for Place of Publication and entering a question mark in the search field, since glancing through the data, this seems to be the way uncertainty is denoted. The text filter limits the view to only those comics that have a question mark in their place of publication. However, looking further into the matter using the facet tool, we see that there are 28 ways in which a question mark appears. Many of them denote the uncertainty in question, others seem to question the spelling of the place’s name.

At this point, it is simple enough to specify which of the twenty-eight are relevant and select them to be included. This gives us 233 records for comics whose place of publication is guessed at. And we can at a glance see other information such as London being by far the most common assumed place of publication, with 183 comics out of the 233 or nearly 80%.