18th November

In my blog exploration of the Analyze Boston website, a comprehensive data hub covering various topics, I’ve decided to focus on environmental datasets. Within this category, I’ve identified nine datasets, including intriguing ones such as “Clough House Archaeology,” “Blue Bike System Data,” “Building Energy Reporting,” and “Rainfall Data.” Notably, the “rainfall data” dataset provides monthly measurements of rainfall in inches, while, interestingly, the “Clough House Archaeology” dataset appears to be devoid of any data.

Delving into the Blue Bike System Data, I’ve specifically examined the Time dataset, discovering information related to birth years, gender, and other relevant details. This dataset seems promising for conducting Time Series Analysis, opening up opportunities to glean insights into temporal trends and patterns.

16th november

In Week 10, as part of my exploration of the Analyze Boston Data Hub, I am currently in the process of selecting data for analysis, with a focus on aligning it with my desired outcomes. Despite initially considering an economic analysis, I encountered a limitation due to the scarcity of data points in that domain. Today, I shifted my attention to the “approved building permits” dataset, aiming to ensure that construction projects adhere to the State Building Code’s safety requirements, prioritizing health and safety.

The chosen dataset revolves around food inspections in Boston, providing information on how businesses that serve food comply with relevant sanitary rules and standards. These inspections occur at least once a year, with follow-up inspections for high-risk facilities. By analyzing this data, I intend to gain insights into the functionality of the Boston area’s food chain, questioning the scale of big brands and exploring whether they operate similarly to other food chains. My goal is to utilize this data for forecasting the future of the food industry, and I plan to experiment with visualization techniques, exploring libraries beyond Matplotlib and Altair to enhance the presentation of my findings.

13th November

In Week 10, I explored the “Analyze Boston” Open Data Hub, which provides comprehensive data from the Boston Planning and Development Authority spanning January 2013 to December 2019. This platform serves as a repository for statistics, maps, and information relevant to daily life in the city. The primary goal is to enhance the accessibility of public information through a standardized technological platform.

The data hub covers a wide array of topics, including Geospatial, City Services, Finance, Environment, Economy, Public Safety, and more, totaling around 246 subjects. The website offers recommendations for popular datasets and highlights newly added or modified datasets.

As I hover through the available topics, I am contemplating where to focus my efforts for my analysis. I am leaning towards delving into the economy model, aiming to conduct analyses on trends, make informed decisions, and communicate insights. Specifically, I am considering exploring areas such as gross domestic product (GDP), inflation rates, consumer spending, and stock market performance.

In my upcoming blog, I plan to narrow down my topic within the economy model and initiate the analysis process to contribute to improvements in that particular field.

8th November

In my blog thus far, I have executed K-means clustering for the first dataset. In this particular blog post, I’ve employed the heat-map technique to discern correlations among the variables, such as age, body camera usage, and signs of mental illness.

To implement the heat-map technique effectively, I utilized the correlation method to establish the relationships between these variables. Heat-maps are a common tool in data visualization, providing a visually pleasing and comprehensible way to represent complex datasets. They prove especially valuable for visualizing intricate relationships or patterns within large data matrices or tables. Within the heat-map, each cell represents the correlation coefficient between two variables, with the color intensity indicating the direction and strength of the connection (whether positive or negative).

5th November

In today’s blog, I conducted K-means clustering on both of the provided datasets. However, during the process, I encountered an obstacle with the first dataset, which involved investigating correlations among variables related to gender, race, city, state, and police departments. The objective was to establish connections between these factors and individual encounters.

While working on this task, I specifically focused on the variables of age and gender to better understand the occurrences of fatalities among males and females. The analysis revealed a notably higher incidence of fatalities among males in comparison to females. To facilitate the correlation analysis with age, I employed a technique known as label encoding to convert gender from a string variable to a float variable. This transformation allowed for a more meaningful evaluation in relation to age.

1 November

In today’s blog, I will primarily focus on topics related to interpreting data in the context of police shootings across various states in the United States. This interpretation is based on the examination of variables within the datasets using statistical methods such as T-Test, Analysis of Variance (ANOVA), and Bayes’ Theorem.

T-Test: The t-test is a statistical hypothesis test commonly used in data analysis to compare the means of two groups and determine whether there is a significant difference between them. It serves as a valuable tool to assess the likelihood that observed differences between two groups are either genuine or occurred by chance. One of the most prevalent types of t-tests is the independent samples t-test, and the analysis in this context involves formulating the Null Hypothesis (H0) and Alternative Hypothesis (H1).

ANOVA: Analysis of Variance, or ANOVA, is a statistical technique that goes beyond the t-test by comparing the means of three or more groups to identify statistically significant differences among them. It extends the applicability of the t-test, which primarily deals with comparing means between two groups, to scenarios involving multiple groups. ANOVA proves to be a particularly valuable tool when assessing whether multiple treatment groups or factors exhibit significant differences from one another.

Bayes’ Theorem: Bayes’ Theorem, named after the 18th-century mathematician and theologian Thomas Bayes, is a fundamental concept in probability theory and statistics. It is widely employed in data analysis, machine learning, and numerous other domains to update and revise the probability of a hypothesis or event based on new data or information. Bayes’ Theorem provides a framework for making probabilistic inferences and drawing conclusions under conditions of uncertainty