Title: BOSTON’S ECONOMIC DIARY: Tracing the Contours of Growth and Challenge (2013-2019)

In “Boston’s Economic Diary,” we embark on a comprehensive journey through the economic heartbeat of Boston from 2013 to 2019. This detailed study utilises a spectrum of data to paint a vivid picture of the city’s economic growth and challenges during this period.

Key Highlights of the Analysis:

  • Sectoral Interplay and Economic Health: An exploration into how varying sectors, like Boston Logan International Flights and Passengers, and Hotel Occupancy, interconnect and influence the city’s economic vitality.
  • Airport Dynamics and Job Market Trends: A deep dive into the increasing trend of Logan Airport passengers and how this surge correlates with employment growth in the city.
  • Hotel Industry’s Economic Indicators: An analysis of Hotel Average Daily Rates and Occupancy, revealing their relationship with Boston’s broader economic trends.
  • Predictive Economic Modeling: Advanced techniques, including neural network modelling, shed light on future economic trends based on key indicators like airport traffic and hotel industry data.

For a comprehensive understanding of our insights and their broader impact, you can access the full report.

Project_Report#3

 

 

8th December

Working on time series analysis in the economy indicators dataset, I probably uncovered the patterns of Hotel Average Rate variables. In my analysis, I have successfully done the Autocorrelation Function (ACF) and Partial Autocorrelation (PACF).

Autocorrelation Function (ACF):

  • The ACF measures the correlation between a time series and its lagged values, or earlier data.
  • It shows the general dependence structure of the time series and computes the correlation coefficients for various lags.
  • Finding the order of a moving average (MA) process can be done with the help of the ACF. Compared to other kinds of processes, the autocorrelations in an MA process drop more slowly.

While working on ACF for building a pre-model, I got the plot for the Hotel average rate. The number of lags to include in the plot is determined by the ‘lag’ parameter, which you can change to suit your needs. The width of the confidence interval surrounding the autocorrelation values is determined by the ‘alpha’ parameter.

 

Partial Autocorrelation Function (PACF):

  • In contrast, the PACF eliminates the impact of the intermediate lags from the measurement of the correlation between a time series and its lagged values.
  • By removing the influence of the intermediate lags, it aids in determining the direct relationship between observations at various lags.
  • When determining the sequence of an autoregressive (AR) process, the PACF is especially helpful. For lags longer than the process order, the partial autocorrelations in an AR process go to zero.

Same as ACF, just need to update the library and I performed the Partial Autocorrelation Function (PACF) for the both the variables.

The below graphs determined ACF & PACF:

 

 

5th December

In this blog, I’ll provide an overview of time series analysis, covering its objectives, the analysis process, and its limitations.

**What is Time Series Analysis?**
Time series analysis involves examining data points gathered regularly over a defined period, offering insights into how variables change over time.

**Objectives of Time Series Analysis:**
The goals include understanding the dynamics of time series variables, gaining insights into changing dataset features over time, and facilitating predicting of future values.

**How to Analyze Time Series:**
1. Gather and clean data.
2. Visualize time versus key features.
3. Assess series stationarity.
4. Create charts to understand characteristics.
5. Utilize ARMA, ARIMA, MA, and AR models.
6. Conclude forecasts.

**Limitations of Time Series Analysis:**
1. Does not handle missing data.
2. Assumes linear relationships between data points.
3. Data transformations can be resource-intensive.
4. Primarily operates on univariate data.

1st December

In delving deeper into time series analysis, I’ve explored topics such as autocorrelation, forecasting, and cyclical analysis. In considering regression analysis, I’ve concluded that conventional linear regression may not be ideal for time series research, especially when dealing with data exhibiting seasonality, trends, or temporal dependencies. Time series regression is better approached with techniques like autoregressive integrated moving average (ARIMA) models, which account for the inherent characteristics of time series data.

Autocorrelation Function (ACF): A statistical tool, ACF assesses the correlation between a time series and its own lagged values. It aids in identifying patterns, trends, and seasonality by displaying correlation coefficients for different lags, highlighting significant delays and autocorrelation patterns.

Forecasting: The forecasting process often begins with using the most recent observed value from historical data. Iterative prediction involves projecting subsequent observations based on the assumption that changes or deviations are random and unpredictable.

Evaluation: Performance assessment of the forecasting model is done using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root Mean Squared Error (RMSE). These metrics compare projected values to actual observations, providing insights into the model’s accuracy.

18th November

In my blog exploration of the Analyze Boston website, a comprehensive data hub covering various topics, I’ve decided to focus on environmental datasets. Within this category, I’ve identified nine datasets, including intriguing ones such as “Clough House Archaeology,” “Blue Bike System Data,” “Building Energy Reporting,” and “Rainfall Data.” Notably, the “rainfall data” dataset provides monthly measurements of rainfall in inches, while, interestingly, the “Clough House Archaeology” dataset appears to be devoid of any data.

Delving into the Blue Bike System Data, I’ve specifically examined the Time dataset, discovering information related to birth years, gender, and other relevant details. This dataset seems promising for conducting Time Series Analysis, opening up opportunities to glean insights into temporal trends and patterns.

16th november

In Week 10, as part of my exploration of the Analyze Boston Data Hub, I am currently in the process of selecting data for analysis, with a focus on aligning it with my desired outcomes. Despite initially considering an economic analysis, I encountered a limitation due to the scarcity of data points in that domain. Today, I shifted my attention to the “approved building permits” dataset, aiming to ensure that construction projects adhere to the State Building Code’s safety requirements, prioritizing health and safety.

The chosen dataset revolves around food inspections in Boston, providing information on how businesses that serve food comply with relevant sanitary rules and standards. These inspections occur at least once a year, with follow-up inspections for high-risk facilities. By analyzing this data, I intend to gain insights into the functionality of the Boston area’s food chain, questioning the scale of big brands and exploring whether they operate similarly to other food chains. My goal is to utilize this data for forecasting the future of the food industry, and I plan to experiment with visualization techniques, exploring libraries beyond Matplotlib and Altair to enhance the presentation of my findings.

13th November

In Week 10, I explored the “Analyze Boston” Open Data Hub, which provides comprehensive data from the Boston Planning and Development Authority spanning January 2013 to December 2019. This platform serves as a repository for statistics, maps, and information relevant to daily life in the city. The primary goal is to enhance the accessibility of public information through a standardized technological platform.

The data hub covers a wide array of topics, including Geospatial, City Services, Finance, Environment, Economy, Public Safety, and more, totaling around 246 subjects. The website offers recommendations for popular datasets and highlights newly added or modified datasets.

As I hover through the available topics, I am contemplating where to focus my efforts for my analysis. I am leaning towards delving into the economy model, aiming to conduct analyses on trends, make informed decisions, and communicate insights. Specifically, I am considering exploring areas such as gross domestic product (GDP), inflation rates, consumer spending, and stock market performance.

In my upcoming blog, I plan to narrow down my topic within the economy model and initiate the analysis process to contribute to improvements in that particular field.

8th November

In my blog thus far, I have executed K-means clustering for the first dataset. In this particular blog post, I’ve employed the heat-map technique to discern correlations among the variables, such as age, body camera usage, and signs of mental illness.

To implement the heat-map technique effectively, I utilized the correlation method to establish the relationships between these variables. Heat-maps are a common tool in data visualization, providing a visually pleasing and comprehensible way to represent complex datasets. They prove especially valuable for visualizing intricate relationships or patterns within large data matrices or tables. Within the heat-map, each cell represents the correlation coefficient between two variables, with the color intensity indicating the direction and strength of the connection (whether positive or negative).

5th November

In today’s blog, I conducted K-means clustering on both of the provided datasets. However, during the process, I encountered an obstacle with the first dataset, which involved investigating correlations among variables related to gender, race, city, state, and police departments. The objective was to establish connections between these factors and individual encounters.

While working on this task, I specifically focused on the variables of age and gender to better understand the occurrences of fatalities among males and females. The analysis revealed a notably higher incidence of fatalities among males in comparison to females. To facilitate the correlation analysis with age, I employed a technique known as label encoding to convert gender from a string variable to a float variable. This transformation allowed for a more meaningful evaluation in relation to age.

1 November

In today’s blog, I will primarily focus on topics related to interpreting data in the context of police shootings across various states in the United States. This interpretation is based on the examination of variables within the datasets using statistical methods such as T-Test, Analysis of Variance (ANOVA), and Bayes’ Theorem.

T-Test: The t-test is a statistical hypothesis test commonly used in data analysis to compare the means of two groups and determine whether there is a significant difference between them. It serves as a valuable tool to assess the likelihood that observed differences between two groups are either genuine or occurred by chance. One of the most prevalent types of t-tests is the independent samples t-test, and the analysis in this context involves formulating the Null Hypothesis (H0) and Alternative Hypothesis (H1).

ANOVA: Analysis of Variance, or ANOVA, is a statistical technique that goes beyond the t-test by comparing the means of three or more groups to identify statistically significant differences among them. It extends the applicability of the t-test, which primarily deals with comparing means between two groups, to scenarios involving multiple groups. ANOVA proves to be a particularly valuable tool when assessing whether multiple treatment groups or factors exhibit significant differences from one another.

Bayes’ Theorem: Bayes’ Theorem, named after the 18th-century mathematician and theologian Thomas Bayes, is a fundamental concept in probability theory and statistics. It is widely employed in data analysis, machine learning, and numerous other domains to update and revise the probability of a hypothesis or event based on new data or information. Bayes’ Theorem provides a framework for making probabilistic inferences and drawing conclusions under conditions of uncertainty

30th October

Thus far in the blog, I’ve discussed the analysis of correlations among variables in the dataset, particularly in relation to incidents of shootings in various states, which are influenced by multiple factors and the age groups of individuals involved.

Subsequently, I delved into the Monte Carlo approximation technique, which allows us to derive approximate numerical results through random sampling, enabling a deeper understanding of the data.

At present, my focus is on employing clustering techniques to unveil hidden insights within the dataset. I’ve implemented the Elbow method in my clustering approach to determine the optimal number of clusters for the analysis. This method involves calculating the inertia, which represents the sum of squared distances between each data point and the cluster center, for a range of cluster sizes and visualizing the results. The “elbow” point, where inertia starts decreasing at a slower rate, indicates the ideal number of clusters.

In the upcoming phases, I will explore clustering techniques further and intend to apply DBSCAN in the analysis to extract more valuable insights.

25th October

During today’s class, I gained insights into various clustering methods.

K-Means, a partitioning technique, arranges data points into K clusters based on their proximity to cluster centers, typically represented by the means of the data points in each cluster. This method aims to minimize the sum of squared distances and is widely applied in practical scenarios.

Conversely, K-Medoids bears similarities to K-Means but designates the medoid, which is the data point positioned most centrally within a cluster, as the cluster’s representative. K-Medoids is particularly useful in situations where resilience to outliers is a critical concern.

DBSCAN, on the other hand, is a density-based approach that identifies clusters as dense regions separated by regions of lower density. This makes it particularly well-suited for datasets featuring irregularly shaped clusters and noise. It relies on two key parameters: an epsilon distance threshold and a minimum number of data points required to define a dense region. DBSCAN is known for its ability to automatically determine the number of clusters, and it exhibits less sensitivity to initial configurations, adapting effectively to diverse data distributions.

23th October

The subject of today’s class lecture is centered on K-means clustering, a method used to analyze incidents of shootings that have occurred in various states of the United States. However, it’s important to note that K-means clustering comes with certain constraints. It operates under the assumptions of spherical, equally sized, and equally dense clusters, which may not always align with the complexities of real-world data. Additionally, the algorithm’s performance can be influenced by the initial selection of cluster centroids, potentially leading to suboptimal outcomes.

In the course of my analysis, I will undertake the task of applying K-means clustering, and I will delve deeper into this topic in subsequent discussions.

20th October

As I progress through this blog, I’ve made the assumption that the analysis of Police Shootings, as reported by the Washington Post, primarily revolves around incidents involving both armed and unarmed individuals, with a focus on racial disparities, particularly between Black and White individuals. In the ongoing phase of my project, I am concentrating on dissecting the racial aspects of the shootings that predominantly occur along the eastern coast, a region that constitutes a significant portion of the recorded incidents. However, it’s worth noting that there are certain irregularities or anomalies present, notably in the state of Florida and along the western coast of the country.

During my recent class, I delved into a new topic, namely, the Monte Carlo approximation. This statistical technique is employed in data analysis to estimate numerical results by means of random sampling. Monte Carlo approximation becomes especially valuable when dealing with complex or high-dimensional integrals, assessing probabilities, or making predictions in scenarios where obtaining analytical solutions proves challenging or even impossible.

Random sampling can be incorporated into the analysis by considering the racial aspects, specifically the Black and White populations, and distributing the sampling evenly across these demographics. Additionally, our professor introduced us to the concept of age distribution among individuals killed by the police, and we can calculate the p-value to gain insight into the exposure factor within our analysis

18th October

Today, I’ve explored the topics that will form the basis of my analysis of the Police shooting data. Additionally, I’ve referred to a PDF file containing geolocations to pinpoint incidents in the United States using latitude and longitude coordinates.

Geopositions, also referred to as geographic coordinates, represent a specific location on the Earth’s surface, often using latitude and longitude values. Python offers numerous modules and tools for working with geo position data, with GeoPy and GeoPandas being two widely used libraries.

One common approach to data analysis involves K-Means clustering, a method that groups data points into K clusters based on their similarity. Several Python libraries are available for performing K-Means clustering, with Scikit-learn being one of the most commonly used ones.

Clustering is a machine learning technique used to group similar data points based on specific attributes or characteristics. In data analysis, clustering helps identify underlying patterns or structures in the data, revealing logical groupings within a dataset. This technique falls under the category of unsupervised machine learning as it leverages inherent similarities and differences in the data rather than relying on labeled data or predefined categories to establish groupings.

In my upcoming blog, I will apply these techniques in my analysis to uncover hidden insights within the dataset.

16th October

In this blog, I primarily discuss the various variables within the given dataset from the Washington Post, with a particular focus on examining their interrelationships. The dataset pertains to fatal police shootings and comprises two distinct sets of data.

The second dataset specifically provides details about law enforcement officers who have been identified in connection with these incidents. It includes variables such as ID, Name, Type (representing the officers’ roles), States, ORI codes, and Total Shootings. While reviewing this data, I observed that the majority of officers listed are affiliated with local police departments, and the highest number of fatal shootings attributed to a single officer is 129.

In a previous blog, I provided an overview of the first dataset, highlighting its 12 variables. My aim is to explore the correlations that may exist among these variables. I also identified key features that may serve as potential points of connection between the two datasets for future analysis.

In my upcoming blog, I will delve deeper into these variables and further explore the correlations that can be drawn between them.

13th October

Today, we delved into the analysis of the disparity in age between White and Black individuals. After aggregating the age data and performing random sampling, we found that the difference in means was quite minimal. However, the resulting p-value from this process was on the order of 10^-78. This strongly suggests that the disparity in means observed in the data is statistically significant.

Furthermore, a similar experiment was conducted with data pertaining to fleeing and not fleeing individuals. Although a similar trend emerged, it appears to be less pronounced. This implies that if a suspect is fleeing, there is a higher likelihood of a shooting incident occurring compared to when they are not fleeing.

In the next steps, I intend to focus on analyzing the data related to the locations of the shootings and the distances to the involved police stations.

Understand Dataset (11thOctober)

In reference to the dataset, it encompasses information about fatal police shootings, and it is organised into two distinct sections. The first section contains details about the victims and the incidents, which can be found in the “fatal-police-shootings-data.csv” file located in the “/v2” directory. The second section comprises data regarding police agencies that have been involved in at least one fatal police shooting since 2015, accessible in the “fatal-police-shootings-agencies.csv” file, also situated in the “/v2” directory. I integrated these two CSV files by using the “agency_ids” value as a reference and eliminated any rows that had missing data (NaN values).

After a discussion with one of my fellow group members, in an effort to gain a more comprehensive understanding of the dataset, I generated a histogram as a visualisation tool. The histogram unveiled that the highest number of incidents involved individuals of the white racial group, followed by Black, Hispanic, Native American, and Asian individuals in descending order. This information provides valuable insights into the distribution of fatal police shootings across various racial groups.

PROJECT2 Day1

The second project entails examining data sourced from the Washington Post data repository, specifically focusing on fatal police shootings in the United States. To initiate the preliminary assessment of the data and its characteristics, I have initiated basic commands, such as “describe()” and “info()”. Currently, the dataset contains 8,770 data points spanning from January 2, 2015, to October 7, 2023. Currently, I am actively comprehending the dataset and exploring potential analyses and applications it offers.

Stats:
Total Shootings: 8,770
Time Span: 2015-2023
States: 51
Cities: 3,374
Police Departments: 3,417

2nd October

Today I just discussed and started working on a report for Project 1 with my teammates. We just found 5 issues to evaluate in the report which are affecting %diabetes in addition to %obesity and %inactivity.  And apart from the report, I’ve started working on the cross-validation part to enhance model accuracy.

I delved into a comprehensive exploration of the Breusch-Pagan test, learning its role in assessing heteroscedasticity within regression models. Heteroscedasticity denotes a scenario where the variance of regression model error terms varies unevenly across different levels of independent variables, thereby challenging a fundamental assumption of linear regression. The Breusch-Pagan test scrutinizes whether a noteworthy correlation exists between the squared residuals of a regression model and its independent variables. When the test reveals a significant relationship, it signals the presence of heteroscedasticity, necessitating adjustments or transformations to rectify this issue in our regression analysis.

28th September

Today I performed Monte Carlo random sampling (1 million times): maximum observed difference in means =0.7887741935483916 and the distribution of the various means is plotted below.

According to today’s lecture, I’ll be going to explore the Urban dataset using a Quadratic regression model. To make the model work even better, I’ll start by using a logarithmic transformation on the dataset. This transformation may improve the model’s accuracy.

 

 

25th September

On the 25th of September, we covered a resampling method- Cross Validation.

Cross-validation is a powerful technique employed to enhance model accuracy by partitioning the dataset into multiple subsets or folds. It involves constructing several models, each trained on a different combination of folds while using the remaining fold for validation. This method is invaluable when dealing with limited data, ensuring efficient training and robust model assessment. We can significantly improve the model’s predictive performance by randomly resampling and using various subsets of the data during training and validation. While a simple division of data into two equal groups for training and validation is possible, it lacks the effectiveness of more advanced approaches like k-fold cross-validation, where ‘k’ represents the number of folds. For instance, using a 10-fold cross-validation approach with the CDC diabetes dataset containing 3100 instances, we can ensure that each fold contains around 310 instances. This rigorous methodology comprehensively evaluates the model’s generalizability and performance.

19th September

Today’s session delved into the analysis of crab molting data while introducing the t-test as a tool for evaluating the statistical significance of mean value differences between two distributions. It’s important to note that the t-test may yield unreliable results when applied to non-normally distributed data. To overcome this limitation, we employed Monte Carlo random sampling to determine whether the observed difference in mean values was indeed significant.

The application of these principles was demonstrated using the diabetes dataset. To begin, I integrated three distinct datasets with additional data obtained from the CDC website, including Obesity Data, Food Access(Limited to healthy food) and Urban-Rural (overall SVI) with the aim of uncovering meaningful correlations among these three variables. These datasets were combined into a single data frame using pandas merge functionality. Upon observing a disparity in the means of these distributions, I’ve started implementing the techniques which are covered in our class.

18th September

After a recent discussion with the professor and TA, it has become clear that the connection between the availability of exercise opportunities and the prevalence of physical inactivity is relatively weak, and in some cases, may even be negligible. This revelation has prompted our team to acknowledge the significance of introducing additional variables into our analysis. By doing so, we aim to attain a more comprehensive and nuanced understanding of the intricate dynamics at play within this relationship.

In our recent project team meeting, we engaged in a thorough discussion regarding our project’s ongoing advancements and collectively generated ideas for additional variables that could enhance our understanding of the issue at hand. I have identified two to three potential variables that merit consideration for inclusion in our analysis. At this juncture, my primary focus revolves around investigating the intricate relationships between Obesity Data, Food Access(Limited to healthy food) and Urban-Rural (overall SVI) with the aim of uncovering meaningful correlations among these three variables.

 

13th September

In my data analysis so far, I’ve worked with an Excel file containing three sheets: %diabetes, %inactivity, and %obesity. I began by loading this data into our Jupyter Notebook using the pandas’ library, allowing us to manipulate and analyze it effectively. I then performed correlation analysis, exploring the relationships between the health indicators in the different sheets. To visualize these correlations, I created a bar plot. Additionally, then applied linear regression between %diabetes and %obesity & %diabetes and %inactivity, helps to understand how one variable might predict the other. Furthermore, I’ve computed and visualized smooth histograms for both %diabetes and %obesity, accompanied by statistics including mean, median, skewness, and kurtosis, providing valuable insights into the distribution and characteristics of these data sets.

Exploring and analyzing a dataset is just the beginning. Depending on the specific dataset, there are various further analyses and actions I want to perform :

  1. Predictive Modeling: I want to build predictive models using machine learning techniques. For example, in the dataset, I could try to predict %obesity based on %diabetes and %inactivity.
  2. Data Visualization: Explore different types of data visualizations beyond histograms, such as scatter plots and box plots to gain deeper insights into the relationships within the data.
  3. Statistical Tests: Apply various statistical tests to assess the significance of relationships or differences between variables. For instance, I’ve found some tests that we can do i.e. t-tests or ANOVA to compare different groups or regions within the same year.

I’ve read some about t-tests or ANOVA, these tests are used to compare groups and assess whether there are significant differences between them. They are widely used in hypothesis testing, which we discuss today in the lecture i.e. there are two types of hypothesis Null hypothesis and Alternate hypothesis. in which, the p-value helps you determine whether there is enough evidence to reject the null hypothesis. If the p-value is smaller than a predetermined significance level (e.g., 0.05), you reject the null hypothesis in favour of the alternative hypothesis.

September 12, 2023

I performed correlations between the given data. I plotted distributions of %diabetic, %obese and %inactive data using the Seaborn library in Python.

And also I’ve read about Kurtosis as we discussed during the lecture i.e. we use this to indicate how peaked a distribution is or how much is in tails.

As I was discussing with my teammates, we found additional data on the CDC website about various counties’ socio-economic status. And apart from this, we’ve also found data related to food excess. So from this, we can differentiate counties as Urban and Rural based on socioeconomic, food excess, transportation, population, etc.

I will be trying to solve this problem via Linear Regression using the Scikit-Learn model. And will try to inculcate more findings.