Babak Shahian Jahromi My Data Science Blog:
    About     Archive     Feed

Crime

header_image

image-title-here

Introduction

The main goal of this project is prediction of safety at a specific location and a particular time. Crimes are a common social problem affecting the quality of life and the economic growth of a society [1]. It is considered an essential factor that determines whether or not people move to a new city and what places should be avoided when they travel [2]. With the increase of crimes, law enforcement agencies are continuing to demand advanced geographic information systems and new data mining approaches to improve crime analytics and better protect their communities [3]. Although crimes could occur everywhere, it is common that criminals work on crime opportunities they face in most familiar areas for them [4]. By providing a predictive machine learning approach to determine the criminal hotspots and the location, time of committed crimes, people’s awareness can be raised regarding the dangerous locations in certain times. Therefore, this project can potentially help people stay away from the locations at a certain time of the day along with saving lives. In addition, having this kind of knowledge can help people to improve their living place choices. On the other hand, police forces can use this solution to increase the level of crime prediction and prevention. Moreover, this would be useful for police resources allocation. [6] It can help in the distribution of police at most likely crime places for any given time, to grant an efficient usage of police resources [5]. By having all of this information available, we hope to make our community safer for the people living there and also for others who will travel there. [6] In this project, I created a web application that receives an address or location and time arguments in Chicago from the user and predicts the probability of crime (hence safety) at a specific location and a particular time, it also searches for crime incidents nearby (100m radius) and marks them on the map as well as listing them in the area and saving the previous searches.

Data Source

The data is Chicago crime dataset from City of Chicago Data Portal. I used ~1.5 years of data (2017 and 5 months of 2018) for computation efficiency reasons, this gives about 350K crime reports. Each crime has 22 columns of information associated with it. Information like location of the crime, crime and location descriptions, type and date of crime, arrest made or not, etc. alt_text

Data pre-processing

Data was cleaned and preprocessed before modeling. Dropped the duplicates, the NA values. Also parsed the date column single date and time stamp and broke it into components (year, month, day, hour, minute). Also, used label encoder function to one hot encode the arrest labels.

Data balancing

From the 350K crime reports, in only about 68K the arrest was made (arrest == True) and the rest no arrest was made. Since the classification is on arrest made or not, it’s important to balance the data so there are similar counts for each class. This was done using a combination of over- and under-sampling. There are systematic algorithms that you can use to generate synthetic samples available in sklearn library. I used SMOTENN or the Synthetic Minority Over-sampling Technique.

Data train and test split

The data was split into training and testing sets. 80% of data assigned for learning and training and the remaining 20% of data assigned for testing purposes.

Feature Selection

The more relevant features picked for predicting crimes in which arrest was made (and therefore predicting safety) are crime date and time (day, month, year, hour, minutes) and crime location (latitude and longitude). In future work, I would add the ability to classify the crime-type in addition to possiblity of crime (safety) at a specific location and time. alt_text

Modeling

I did grid search modeling, prediction and evaluation using different machine learning algorithms, namely: Logistic Regression, Naive Bayes, Support Vector Machines, K-Nearest Neighbors, Decision Trees, Random Forest, and XG-Boosting. From the grid search the best parameters were used to fit each model and use them to predict and evaluate their performance. The results are summarized in the table below: alt_text

By comparing the models, the model I selected for this classification task is XG-Boost. This model gives the best performance metrics across all categories (precision, recall, F1 score, and accuracy) and although in terms of training and testing speed is not the fastest algorithm but it’s not the slowest either. Here are more detail on the XG-Boost classification on arrest made or not:

alt_text

Getting Started

When you load the application, you have two options: One is to enter the location and datetime arguments in this format (day, hour, latitude, longitude, minute, month) and click the “is it safe?” button. This will return the probability of safety (crime with arrest not happening). For example in the case below user searched for West Ridge area (west side of Chicago) and predicted 0.04 chance of safety or 96% chance of crime at that time, date and location.

alt_text

Two is to enter your desired address in the search box, Google will automatically complete the address. Click the “find the crimes” button. The page will center and zoom in the map, marking all the crimes within 100 meters radius in the area. Hover over the markers to find more information about the crime such as, description, block and date of the crime incident. You can select the ‘Street View’ to look and move around for better visualization of the incidents. To see the complete results list, click on the ‘View Results List’ button on the page. The application will also save the last three searches, you can see the previous searches by clicking on the ‘Previous Searches’ button.

alt_text

Contributing

Please feel free to submit pull requests for development. The following instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You need the following software for testing and development:

  • Python (version 3)
  • Command line interface
  • Web browser (Microsoft Edge, Firefox, Chrome and Safari supported)
  • Source code editor like Atom or Sublime Text
  • Git source control manager

Cloning Repository

Start by opening the command line and downloading the repository as follows

git clone https://github.com/BabakShah/DS-SafetyPrediction.git

change the directory to the project folder and the desired source file (Python, HTML, CSS, JS)

cd ./DS-SafetyPrediction/

in the command line, download all the python library dependencies

pip install -r /path/to/requirements.txt

to launch the webapp

python app.py

open the following page in browser

http://127.0.0.1:5000/

for further development, open the source files in a source code editor. For Python scripts

open -a "Sublime Text" file-name

for IPython notebooks

jupyter notebook file-name

Built With

  • Python alt_text
  • HTML html_logo
  • CSS css_logo
  • JS js_logo
  • Firebase firebase_logo
  • Flask flask_logo
  • Git git_logo
  • Command Line Interface cli_logo

APIs

google_maps_api Google Maps API: https://developers.google.com/maps/
chicago_crime_api Chicago Crime API: http://api1.chicagopolice.org/clearpath/

Libraries used

  • Scikit-learnsklearn_logo
  • Matplotlib matplotlib_logo
  • Numpy numpy_logo
  • Pandas pandas_logo
  • Jupyter Notebook jupyter_logo
  • Moment.js momentjs_logo

References

[1] A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pianesi and A. Pentland, ‘Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data’, CoRR, vol. 14092983, 2014.

[2] R. Arulanandam, B. Savarimuthu and M. Purvis, ‘Extracting Crime Information from Online Newspaper Articles’, in Proceedings of the Second Australasian Web Conference - Volume 155, Auckland, New Zealand, 2014, pp. 31-38.

[3] A. Buczak and C. Gifford, ‘Fuzzy association rule mining for community crime pattern discovery’, in ACM SIGKDD Workshop on Intelligence and Security Informatics, Washington, D.C., 2010, pp. 1-10.

[4] M. Tayebi, F. Richard and G. Uwe, ‘Understanding the Link Between Social and Spatial Distance in the Crime World’, in Proceedings of the 20th International Conference on Advances in Geographic Information Systems (SIGSPATIAL ‘12), Redondo Beach, California, 2012, pp. 550-553.

[5] S. Nath, ‘Crime Pattern Detection Using Data Mining’, in Web Intelligence and Intelligent Agent Technology Workshops, 2006. WI-IAT 2006 Workshops. 2006 IEEE/WIC/ACM International Conference on, 2006, pp. 41,44.

[6] T. Almanie, R. Mirza, and E. Lor, “Crime prediction based on crime types and using spatial and temporal criminal hotspots,” arXiv preprint arXiv:1508.02050, 2015.

Contact me

Babak - email

Prediction of Marathon Time

image-title-here

Introduction

The main goal of this project is prediction of marathon race time in male and female runners.

Data Source

The data is sourced from the website of six world major marathons namely: Tokyo, Boston, London, Berlin, Chicago, NYC. The data is scraped from these websites using web scraping technologies like Beautiful Soup and Selenium

alt_text

Feature Selection

The more relevant features picked for predicting marathon times are athlete’s age, percentage of body fat, average running speed, weather, terrain, and running footwear. alt_text

Modeling

I used two Ordinary Least Square models for male and female athletes. I also tried Lasso and Ridge regression with cross validation and tuning. Lasso regression with alpha=0.0001 was my best model. The adjusted R-squared of the male model was 0.44 and for the female model was 0.71. The images below show the final models for both genders:

Male race time (min) = 285.4

  • 2.316 × Body Fat (%) − 10.446 × Running Speed (kph) + 1.38 × Age − 1.02 × Running Shoe

Female race time (min) = 337.4

  • 2.154 × Body Fat (%) − 15.112 × Running Speed (kph) + 1.27 × Age − 1.04 × Running Shoe

alt_text

Contributing

Please feel free to submit pull requests for development. The following instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You need the following software for testing and development:

  • Python (version 3)
  • Command line interface
  • Web browser (Microsoft Edge, Firefox, Chrome and Safari supported)
  • Source code editor like Atom or Sublime Text
  • Git source control manager

Cloning Repository

Start by opening the command line and downloading the repository as follows

git clone https://github.com/BabakShah/...

change the directory to the project folder and the desired source file (Python, HTML, CSS, JS)

cd ./DS-Saf

in the command line, download all the python library dependencies

pip install -r /path/to/requirements.txt

for further development, open the source files in a source code editor. For Python scripts

open -a "Sublime Text" file-name

for IPython notebooks

jupyter notebook file-name

Built With

  • Python alt_text
  • Git git_logo
  • Command Line Interface cli_logo

Libraries used

  • Scikit-learnsklearn_logo
  • Matplotlib matplotlib_logo
  • Numpy numpy_logo
  • Pandas pandas_logo
  • Jupyter Notebook jupyter_logo

References

[1] A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pianesi and A. Pentland, ‘Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data’, CoRR, vol. 14092983, 2014.

Contact me

Babak - email