What if cops don't bother to put the right data in?

I'm working on some analysis on the article that came out in the Town Crier about Los Altos Hills hiring private security, but in the course of doing that analysis, I came across this interesting situation. So it goes in data analysis.

Anyway, this turns out to be interesting. Los Altos Hills doesn't have their own police department; for now, they are contracting for police services from the Santa Clara County Sheriff's Department. The SCC Sheriff's department has incident data on CityProtect, which anyone can download. The trouble is, the incident data only includes the "block-sized address", and not the city where the incident occurred. 

It's easy enough to see the "200 Block CAROL DR", put that into Google Maps, and see where it is. Turns out it's San Jose. But with around 32,000 incidents in the first 9 months of 2023, we need an automated way to do that.

In a future post I'll nerd out about how to do that, but for now, just know that there are internet servers that will let you do this automatically. I wrote some code to use them, and they often worked -- but not always. I couldn't figure out why it was working sometimes, but not others. 

A friendly developer on Reddit made some suggestions and noted that there were a lot of duplicates in the data. Now, that's interesting, and a fork in the road. The left fork digs into why duplicates might cause my code to fail sometimes but not always. We'll save that for another day, and cover it with nerdy warning labels.

But the right fork says, "Wait a minute. Duplicate addresses for incidents? I mean, maybe you have a handful of incidents if the same store gets broken into, or maybe there is a park where cops bust a lot of kids for smoking dope or drinking. But there shouldn't be a LOT of duplicates."

Fortunately, it's easy to find out. Here's a table of the most frequent entries for the address where incidents occurred in the first 9 months of 2023.

1 Block BLOCK S ABEL ST 672

1 Block 640

1 Block BLOCK W HEDDING ST 627 . 253 STEVENS CREEK BLVD 247 1 Block BLOCK S BASCOM AV 246 200 Block BLOCK STEVENS CREEK BLVD 214 1 Block BLOCK N 1ST ST 197 1 Block BLOCK S DE ANZA BLVD 186 100 Block BLOCK W YOUNGER AV 171 MONTEREY HWY 170 2700 Block CAROL DR 157 SARATOGA AV 142 FY 101 141 1 Block BLOCK ENBORG CT 122 20700 Block STEVENS CREEK BLVD 112 E SANTA CLARA ST 107 1 Block BLOCK TASMAN DR 100 100 Block BLOCK FRUITVALE AV 100 N 1ST ST 100 1 Block BLOCK S CAPITOL AV 97 100 Block BLOCK UNIVERSITY AV 92

Well, that was unexpected, wasn't it? What's going on there?

Start with the second entry: "1 Block". It's basically blank. If you go into the data, almost all of these entries have Incident Type "PHONE UR OFFICE, OR:". That's a little cryptic, but it looks like a phone call, or a request for one, so it doesn't really matter where the officer was at the moment they got that. So that seems plausible.

But what about that first entry? 672 incidents in the same block? What is that block, a war zone? Street view shows Abel Plaza, a nice looking strip mall with some restaurants. This doesn't look like a hotbed of crime.

"Abel" is an interesting street name, though. It's hard to find a list of all the streets in Santa Clara County in alphabetical order, but this one starts with Aberdeen. Abel comes before Aberdeen. What if some officers are just picking the first street alphabetically when entering their reports, rather than actually using the real location of the incident?

Well, what is in the 1 block (i.e., addresses from 1 to 99) of W Hedding Street? The office of the District Attorney. Further down the list we see 1st Street and Younger; that's the corner where the Sheriff's Office is. Then, there's ".". That sure looks like someone didn't bother putting in the data.

Still, maybe there is a reasonable explanation. Let's look at the incident types for all of the incidents at Hedding Street. 

TAKE A REPORT 156 ASSAULT AND BATTERY 126 DISTURBANCE 101 MISDEMEANOR WANT 63 MALICIOUS MISCHIEF 53 DISTURBANCE, FIGHT 18 NARCOTICS 15 PHONE UR OFFICE, OR: 15 ASSAULT 11 SUSPICIOUS CIRCUMSTANCES 11 BATTERY 6 CRIMINAL THREATS 6 FELONY WANT 6

It makes sense that taking a report might happen in an office. I would have thought they happened at the Sheriff's Office, not the DA's office, but I don't know how their procedures work. So lets give them the benefit of the doubt on 156 of the incidents. But the rest? That's a lot of assaults and disturbances in the DA's office.

How about "1 Block BLOCK N 1ST ST," which contains the Sheriff's office?
DISTURBANCE 66 TRESPASSING 14 INDECENT EXPOSURE 8 NARCOTICS 7 DRUNK IN PUBLIC 7 SUSPICIOUS CIRCUMSTANCES 6 TRAFFIC HAZARD 6 TAKE A REPORT 6

I mean, maybe there are a lot of disturbances in the police station, with people upset about getting arrested and such. But 14 incidents of trespassing? And 6 traffic hazards?

Here's the code I used if you want to play with the data yourself.

It seems pretty likely that there are a lot of incident entries being made with bogus location data. If I had to guess, there are a few officers that regularly just don't bother to enter the data correctly. The officer ID is not made public, but an internal search would be easy: Just use the CCNs from the incidents with likely bogus locations, and find the set of officers. If it's a computer problem, it will probably be spread across all the officers. If some officers are just not doing their job, there will be only a few that pop out of that list.

Bottom line

It sure looks like there are many incidents for which bogus location data has been entered. That's bad because it means the data is polluted, but it's also bad because it means officers are ignoring one of their job responsibilities. Whether or not they think it's important to get the data right, their oversight bodies are requiring it. They need to get it right.

This isn't the only report of mis-entered data. This article in the SF Standard found that cops were deliberately misreporting data. The state RIPA Board heard about a DOJ analysis that found the "officer shots fired" database doesn't match up very well at all with the RIPA data. Any time an officer fires their gun, it should show up in both, but they found less than 20% of the events actually did.

Misreporting data is a self-fulfilling prophecy. Cops put bad data in the database, then claim that we can't depend on any data analysis because the data is bad. How can we have accountability, if those being held accountable can undermine the process?

Comments

Popular posts from this blog

What's going on in Sunnyvale?

RIPA Update for Los Altos

Is there bias in our policing?