"Statistics. The science that says if I ate a whole chicken and you didn't eat any, then each of us ate half a chicken." Dino “Pitigrilli” Segrè
Step 1: Understanding the data
When we first started analyzing the summary data, a couple of elements caused me to wonder how the information was grouped: traffic stops were down from 5 years prior but not dramatically and not drawn attention to in the narrative, arrests were up - cited and released were down - together there was not a major difference, there was a 3 to 1 ratio of charges to arrests, but most significantly was the inclusion of 2020 the year of COVID in the averages.
2020 was an anomaly in data points. How to treat an anomaly is important because it can impact statistical calculations such as "average". There are some accepted practices, remove it from consideration, run the calculation both ways and show the difference to let the audience decide or include a footnote at a minimum.
That first pass did not match the narrative, which resulted in a further investigation. The source for the published information was the police RMS (records management system). This source has significant PII (personal identifying information), and the effort to redact sensitive information can be high and time-consuming. The other source was the dispatch system, which would have less information but had many of the elements needed for initial analysis.
We requested two years of dispatch data: 2022 and 2023. This was a balance of having enough data to analyze and compare without the burden of gathering the data being too high.
From the dispatch system, we now had (plus a few more data points):
- Incident number
- Date & Time
- Cross Street
- Zip code
- Activity category
- Police Department responding
Matching up the information was straightforward:
- "Incident number" equaled "Calls for Service." That was clarified as a citizen, or a police officer could generate an incident. There was no way to break this down. While we thought it might be vital as we progressed, that subtlety was not relevant and could be roughly correlated to the activity category.
- The activity category was consistent with the RMS.
- Time stamps allowed correlation with public noticeable incident reports for deeper dives into specific activities.
Having a common data definition was critical in establishing credibility in the analysis. "Calls for service" were first thought to be just citizens calling 911, but when it was learned that officers could also initiate a "call for service," the overall information made more sense.
We now had over 5,000 records to dive into, and immediately, deeper insights began to emerge. To understand those insights, we needed more data.
Step 1: Understanding the data
Step 2: Building the data set
- What is not clear
- What information can be tied together to improve that clarity
Step 3: Quality control
Step 4: Presentation of insight
Step 5: Lessons learned.
By: Patrick Grant, Director of Public Sector Sales