Written Report – 6.419x Module 2

Name: Ruihao Zhang

▪

Problem 2

(

1-1). (3 points) Provide at least one visualization which clearly shows the existence of three main brain

cell types as described by the scientist, and explain how it shows this. Your visualization should support the

idea that cells from different groups can differ greatly.

Solution: Referring to the previous section, I first converted the .npy file to a csv file and then log-

transformed the data. I then performed a PCA analysis projection on the first 15 principal components,

followed by t-SNE dimensionality reduction with the perplexity set to 50. I set the complexity to 40 and set

the principal components to 3 because 3 cells have been written in the header.

Class 1

Class 2

Class 3

As the figure shows, the three taxa indicated by different colors can be seen overall, which is consistent

with the three main brain cell types described by scientists, and the data for the three cell types can be

clearly distinguished.

(

1-2). (4 points) Provide at least one visualization which supports the claim that within each of the three

types, there are numerous possible sub-types for a cell. In your visualization, highlight which of the three

main types these sub-types belong to. Again, explain how your visualization supports the claim.

Solution: To illustrate the presence of subcategories in the total category division, I draw on the K-means

clustering demonstration of the previous sections. Based on the previous results, here I try to set the total

number of subcategories in the clustering to 8. I have performed K-means clustering based on the previous

code, with different colors corresponding to different classes and keeping the overall shape consistent with

the original, thus indicating which of the three main types these subtypes belong to.

Class 1

Class 2

Class 3

As the figure shows, the three taxa represented by different regions can be seen overall, which is

consistent with the previous division. We can see the subclasses represented by the eight colors, which

make up the three main groups.

(

2-1). (4 points) Using your clustering method(s) of choice, find a suitable clustering for the cells. Briefly

explain how you chose the number of clusters by appropriate visualizations and/or numerical findings. (to

cluster cells into the subcategories instead of categories)

Solution: In order to determine how many categories in total would be appropriate to set up, I used the

elbow method as well as the Sihoutte Score to select the number of K-Means clusters. As shown below, it

is more appropriate to choose a clustering of 7.

The t-SNE plot after PCA projection was then plotted with reference to previous experience, with

different colors indicating different subdivision categories. It can be seen that when the subcategory is set

to 7, it can be clearly distinguished.

(

2-2). (6 points) We will now treat your cluster assignments as labels for supervised learning. Fit a logistic

regression model to the original data (not principal components), with your clustering as the target labels.

Since the data is high-dimensional, make sure to regularize your model using your choice of l1, l2, or elastic

net, and separate the data into training and validation or use cross-validation to select your model. Report

your choice of regularization parameter and validation performance.

Solution: Use previous clustering assignments as labels for supervised learning. Fit a logistic regression

model to the raw data (non-primary components) and use the clusters as target labels.

▪

Feature Selection: Select the 100 best features using SelectKBest.

Data Normalization: Normalize features using StandardScaler.

Logistic Regression Modeling: Use LogisticRegressionCV to model multiple classifications and

select l2 regularization.

▪

Cross Validation: Performs cross validation using cross_val_score and outputs the cross

validation score and average score.

Plot cross-validation scores: Plot the cross-validation scores for each fold.

0

.9977

0

.9931

0

.9747

0

.9654

0

.8868

Optimal regularization parameters: [5.99484250e-03, 1.29154967e+03, 3.59381366e-01, 3.59381366e-

, 2.78255940e+00, 3.59381366e-01, 2.78255940e+00]

0

Cross-validation scores: [0.9654, 0.9977, 0.9931, 0.9747, 0.8868]

Mean cross-validation score: 0.9635

3. (9 points) We will now treat your cluster assignments as labels for supervised learning. Fit a logistic

regression model to the original data (not principal components), with your clustering as the target labels.

Since the data is high-dimensional, make sure to regularize your model using your choice of l1, l2, or elastic

net, and separate the data into training and validation or use cross-validation to select your model. Report

your choice of regularization parameter and validation performance.

Solution: For this task I chose the p2_evaluation_reduced dataset due to computation limitations and

performed the following.

▪

Data loading: Load training and test data from specified paths.

Logarithmic transformation: apply logarithmic transformation log2(x + 1) to the data.

Feature selection:

Top 100 features chosen randomly: features are chosen randomly.

Top 100 features with the largest variance: select features with the largest variance.

▪

Standardization: standardize the selected features.

Model training and evaluation: train and evaluate the performance of different feature selection

methods using logistic regression models.

▪

Plotting Histograms: Compare the variance distributions of the selected features with those of the

high variance features.

As shown in the figure below, the distribution of the 100 randomly selected features is not the same as

the distribution of the top 100 features with the highest variance. The data distribution of the way of

selecting the features with the largest variance is more concentrated, mainly distributed around variance 15,

while the data distribution of the way of randomly selecting is more dispersed.

The experimental results corresponding to the two feature selection methods are shown in the figure

below, which clearly shows that the top 100 features with the highest variance selection method has

achieved significant advantages in each index, fully proving the effectiveness of the method.

Accuracy

0.5099

AUC

Precision

0.6450

Recall

0.5099

0.9034

F1 Score

0.5393

random features

0.9559

0.9983

high variance features

0.9034

0.9091

0.9018

▪

Problem 3

1

. (3 points) When we created the T-SNE plot in Problem 1, we ran T-SNE on the top 50 PC's of the data.

But we could have easily chosen a different number of PC's to represent the data. Run T-SNE using 10, 50,

00, 250, and 500 PC's, and plot the resulting visualization for each. What do you observe as you increase

1

the number of PC's used?

Solution:

Run T-SNE using 10 PC's:

Run T-SNE using 50 PC's:

Run T-SNE using 100 PC's:

Run T-SNE using 250 PC's:

Run T-SNE using 500 PC's:

During the experiment, I only adjusted the PC's value, and other parameters are the same as in Problem

1. By comparing the above experimental results, it can be found that when the PC's value gradually

increases, the points of each cluster in the dimensionality reduction result of T-SNE will become sparse and

the number of classes will change. Specifically, it can be found in the results that the number of clusters

corresponding to 100PC's and 250PC's is 5, and the number of clusters corresponding to 100PC's and

250PC's in general is 3 (two clusters in the 100PC's seem to be divisible into two classes internally, and

one cluster in the 100PC's seems to be divisible into two classes internally) and the points in the clusters

are sparser than before. Note that when the PC's value reaches 500, the corresponding number of clusters

is 4 (one cluster corresponds to very few points).

2. (13 points) Pick three hyper-parameters below (the 3 is the total number that a report needs to analyze.

It can take a) 2 from A, 1 from B, or b) 1 from A, 2 from B.) and analyze how changing the hyper-parameters

affect the conclusions that can be drawn from the data. Please choose at least one hyper-parameter from

each of the two categories (visualization and clustering/feature selection). At minimum, evaluate the hyper-

parameters individually, but you may also evaluate how joint changes in the hyper-parameters affect the

results. You may use any of the datasets we have given you in this project. For visualization hyper-

parameters, you may find it productive to augment your analysis with experiments on synthetic data, though

we request that you use real data in at least one demonstration.

Solution: The experimental parameters I selected are Category A (T-SNE perplexity, T-SNE learning rate),

Category B (Effect of number of PC's chosen on clustering). The experimental data I used is the

experimental data of Problem 1.

I adjusted the values of PC's independently during the experiment, and then adjusted the T-SNE perplexity

and T-SNE learning rate at the same time. 25 images can be obtained for each value of PC's, and I made a

5×5 arrangement for easy comparison. T-SNE perplexity trial parameters: 10,20,30,40,50. T-SNE Learning

Rate parameter:100,500,1500,2000,2500. PC quantity parameters: 10,50,100,250,500.

PC quantity parameters: 10

PC quantity parameters: 50

PC quantity parameters: 100

PC quantity parameters: 250

PC quantity parameters: 500

Conclusion: Individually, when the PC quantity parameters increase, the number of clustered categories

generally decreases and the data points within the classes become sparse. When the T-SNE perplexity trial

parameter increases, the number of clustered categories generally decreases and the intra-class data

becomes denser. When the T-SNE Learning Rate parameter increases, the data points obtained after T-SNE

dimensionality reduction will generally decrease and many discrete data points will appear.

Taken together, when the T-SNE perplexity trial parameters and the T-SNE Learning Rate parameter

increase at the same time, the number of categories decreases and there are still a small number of discrete

points when both of them reach the maximum parameter set in the parameter set. The overall effect of the

two too large will seriously affect the effect of T-SNE, T-SNE Learning Rate parameter value of about 100-

1500 can be effective, T-SNE perplexity trial parameters about 20-50 can be effective, PC quantity

parameter is about 10-250, and T-SNE perplexity trial parameter is about 20-50.