Written Report 6.419x Module 2  
Name: Ruihao Zhang  
Problem 2  
(
1-1). (3 points) Provide at least one visualization which clearly shows the existence of three main brain  
cell types as described by the scientist, and explain how it shows this. Your visualization should support the  
idea that cells from different groups can differ greatly.  
Solution: Referring to the previous section, I first converted the .npy file to a csv file and then log-  
transformed the data. I then performed a PCA analysis projection on the first 15 principal components,  
followed by t-SNE dimensionality reduction with the perplexity set to 50. I set the complexity to 40 and set  
the principal components to 3 because 3 cells have been written in the header.  
Class 1  
Class 2  
Class 3  
As the figure shows, the three taxa indicated by different colors can be seen overall, which is consistent  
with the three main brain cell types described by scientists, and the data for the three cell types can be  
clearly distinguished.  
(
1-2). (4 points) Provide at least one visualization which supports the claim that within each of the three  
types, there are numerous possible sub-types for a cell. In your visualization, highlight which of the three  
main types these sub-types belong to. Again, explain how your visualization supports the claim.  
Solution: To illustrate the presence of subcategories in the total category division, I draw on the K-means  
clustering demonstration of the previous sections. Based on the previous results, here I try to set the total  
number of subcategories in the clustering to 8. I have performed K-means clustering based on the previous  
code, with different colors corresponding to different classes and keeping the overall shape consistent with  
the original, thus indicating which of the three main types these subtypes belong to.  
Class 1  
Class 2  
Class 3  
As the figure shows, the three taxa represented by different regions can be seen overall, which is  
consistent with the previous division. We can see the subclasses represented by the eight colors, which  
make up the three main groups.  
(
2-1). (4 points) Using your clustering method(s) of choice, find a suitable clustering for the cells. Briefly  
explain how you chose the number of clusters by appropriate visualizations and/or numerical findings. (to  
cluster cells into the subcategories instead of categories)  
Solution: In order to determine how many categories in total would be appropriate to set up, I used the  
elbow method as well as the Sihoutte Score to select the number of K-Means clusters. As shown below, it  
is more appropriate to choose a clustering of 7.  
The t-SNE plot after PCA projection was then plotted with reference to previous experience, with  
different colors indicating different subdivision categories. It can be seen that when the subcategory is set  
to 7, it can be clearly distinguished.  
(
2-2). (6 points) We will now treat your cluster assignments as labels for supervised learning. Fit a logistic  
regression model to the original data (not principal components), with your clustering as the target labels.  
Since the data is high-dimensional, make sure to regularize your model using your choice of l1, l2, or elastic  
net, and separate the data into training and validation or use cross-validation to select your model. Report  
your choice of regularization parameter and validation performance.  
Solution: Use previous clustering assignments as labels for supervised learning. Fit a logistic regression  
model to the raw data (non-primary components) and use the clusters as target labels.  
Feature Selection: Select the 100 best features using SelectKBest.  
Data Normalization: Normalize features using StandardScaler.  
Logistic Regression Modeling: Use LogisticRegressionCV to model multiple classifications and  
select l2 regularization.  
Cross Validation: Performs cross validation using cross_val_score and outputs the cross  
validation score and average score.  
Plot cross-validation scores: Plot the cross-validation scores for each fold.  
0
.9977  
0
.9931  
0
.9747  
0
.9654  
0
.8868  
Optimal regularization parameters: [5.99484250e-03, 1.29154967e+03, 3.59381366e-01, 3.59381366e-  
, 2.78255940e+00, 3.59381366e-01, 2.78255940e+00]  
0
Cross-validation scores: [0.9654, 0.9977, 0.9931, 0.9747, 0.8868]  
Mean cross-validation score: 0.9635  
3. (9 points) We will now treat your cluster assignments as labels for supervised learning. Fit a logistic  
regression model to the original data (not principal components), with your clustering as the target labels.  
Since the data is high-dimensional, make sure to regularize your model using your choice of l1, l2, or elastic  
net, and separate the data into training and validation or use cross-validation to select your model. Report  
your choice of regularization parameter and validation performance.  
Solution: For this task I chose the p2_evaluation_reduced dataset due to computation limitations and  
performed the following.  
Data loading: Load training and test data from specified paths.  
Logarithmic transformation: apply logarithmic transformation log2(x + 1) to the data.  
Feature selection:  
Top 100 features chosen randomly: features are chosen randomly.  
Top 100 features with the largest variance: select features with the largest variance.  
Standardization: standardize the selected features.  
Model training and evaluation: train and evaluate the performance of different feature selection  
methods using logistic regression models.  
Plotting Histograms: Compare the variance distributions of the selected features with those of the  
high variance features.  
As shown in the figure below, the distribution of the 100 randomly selected features is not the same as  
the distribution of the top 100 features with the highest variance. The data distribution of the way of  
selecting the features with the largest variance is more concentrated, mainly distributed around variance 15,  
while the data distribution of the way of randomly selecting is more dispersed.  
The experimental results corresponding to the two feature selection methods are shown in the figure  
below, which clearly shows that the top 100 features with the highest variance selection method has  
achieved significant advantages in each index, fully proving the effectiveness of the method.  
Accuracy  
0.5099  
AUC  
Precision  
0.6450  
Recall  
0.5099  
0.9034  
F1 Score  
0.5393  
random features  
0.9559  
0.9983  
high variance features  
0.9034  
0.9091  
0.9018  
Problem 3  
1
. (3 points) When we created the T-SNE plot in Problem 1, we ran T-SNE on the top 50 PC's of the data.  
But we could have easily chosen a different number of PC's to represent the data. Run T-SNE using 10, 50,  
00, 250, and 500 PC's, and plot the resulting visualization for each. What do you observe as you increase  
1
the number of PC's used?  
Solution:  
Run T-SNE using 10 PC's:  
Run T-SNE using 50 PC's:  
Run T-SNE using 100 PC's:  
Run T-SNE using 250 PC's:  
Run T-SNE using 500 PC's:  
During the experiment, I only adjusted the PC's value, and other parameters are the same as in Problem  
1. By comparing the above experimental results, it can be found that when the PC's value gradually  
increases, the points of each cluster in the dimensionality reduction result of T-SNE will become sparse and  
the number of classes will change. Specifically, it can be found in the results that the number of clusters  
corresponding to 100PC's and 250PC's is 5, and the number of clusters corresponding to 100PC's and  
250PC's in general is 3 (two clusters in the 100PC's seem to be divisible into two classes internally, and  
one cluster in the 100PC's seems to be divisible into two classes internally) and the points in the clusters  
are sparser than before. Note that when the PC's value reaches 500, the corresponding number of clusters  
is 4 (one cluster corresponds to very few points).  
2. (13 points) Pick three hyper-parameters below (the 3 is the total number that a report needs to analyze.  
It can take a) 2 from A, 1 from B, or b) 1 from A, 2 from B.) and analyze how changing the hyper-parameters  
affect the conclusions that can be drawn from the data. Please choose at least one hyper-parameter from  
each of the two categories (visualization and clustering/feature selection). At minimum, evaluate the hyper-  
parameters individually, but you may also evaluate how joint changes in the hyper-parameters affect the  
results. You may use any of the datasets we have given you in this project. For visualization hyper-  
parameters, you may find it productive to augment your analysis with experiments on synthetic data, though  
we request that you use real data in at least one demonstration.  
Solution: The experimental parameters I selected are Category A (T-SNE perplexity, T-SNE learning rate),  
Category B (Effect of number of PC's chosen on clustering). The experimental data I used is the  
experimental data of Problem 1.  
I adjusted the values of PC's independently during the experiment, and then adjusted the T-SNE perplexity  
and T-SNE learning rate at the same time. 25 images can be obtained for each value of PC's, and I made a  
5×5 arrangement for easy comparison. T-SNE perplexity trial parameters: 10,20,30,40,50. T-SNE Learning  
Rate parameter:100,500,1500,2000,2500. PC quantity parameters: 10,50,100,250,500.  
PC quantity parameters: 10  
PC quantity parameters: 50  
PC quantity parameters: 100  
PC quantity parameters: 250  
PC quantity parameters: 500  
Conclusion: Individually, when the PC quantity parameters increase, the number of clustered categories  
generally decreases and the data points within the classes become sparse. When the T-SNE perplexity trial  
parameter increases, the number of clustered categories generally decreases and the intra-class data  
becomes denser. When the T-SNE Learning Rate parameter increases, the data points obtained after T-SNE  
dimensionality reduction will generally decrease and many discrete data points will appear.  
Taken together, when the T-SNE perplexity trial parameters and the T-SNE Learning Rate parameter  
increase at the same time, the number of categories decreases and there are still a small number of discrete  
points when both of them reach the maximum parameter set in the parameter set. The overall effect of the  
two too large will seriously affect the effect of T-SNE, T-SNE Learning Rate parameter value of about 100-  
1500 can be effective, T-SNE perplexity trial parameters about 20-50 can be effective, PC quantity  
parameter is about 10-250, and T-SNE perplexity trial parameter is about 20-50.