Reddit Post Classification

Muhammad,Text Classification

Problem Statment:

In this project, we want to see if we can tell apart posts from two popular subreddits: datascience and wallstreetbets, or any two subreddits or categories a text might belong to in general. Can we create a simple tool that accurately says which group a post is more likely to come from? And what words or phrases give us the best clues about this?

import praw
import pandas as pd
import nltk
import pandas as np
import unicodedata
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
 
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split 
from textblob import Word
import numpy as np
import spacy
import re
import matplotlib.pyplot as plt
 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
 
from sklearn.feature_extraction.text import CountVectorizer
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.svm import LinearSVC
reddit = praw.Reddit(
    client_id='client_id goes here',
    client_secret='client secret goes here',
    user_agent='Pro3',
    username='-__A__-',
    password=''
)
# Below is JUST an example of how you can use PRAW
 
# Choose your subreddit
subreddit_DataScience = reddit.subreddit('DataScience')
subreddit_wallstreetbets = reddit.subreddit('wallstreetbets')
 
# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts_DS = subreddit_DataScience.new(limit=2525)
posts_wsb = subreddit_wallstreetbets.new(limit=2525)
data = []
for post in posts_DS:
    data.append([post.title, post.selftext, post.subreddit])
 
# Turn into a dataframe
datascience = pd.DataFrame(data, columns = ['title', 'self_text', 'subreddit'])
datascience.head()
titleself_textsubreddit
0What Every Developer Should Know About GPU Com...datascience
1Do you use CRUD or like apps to bridge the gap...In my about 5 years of experience working for ...datascience
2Anybody ever been drug tested for handling sen...I am currently a DA for a company that uses da...datascience
3Any data imputation technique shares?Hello, \n\nI’ve been reading up some articles ...datascience
4Application of classical time series and deep ...datascience
data_wsb = []
for post in posts_wsb:
    data_wsb.append([post.title, post.selftext, post.subreddit])
 
# Turn into a dataframe
wsb = pd.DataFrame(data_wsb, columns = ['title', 'self_text', 'subreddit'])
wsb.head()
titleself_textsubreddit
0How Can I Bet Against the US Defaulting on Debt?With the US slipping into 33.5 trillion of deb...wallstreetbets
1Corporate Bankruptcies - next shoe...wallstreetbets
2Seeking wisdomTesla announced their earnings and the stock w...wallstreetbets
3Hedge funds, Pension funds, Banks, CEOs and th...wallstreetbets
4Bulls: "WE ARE SO OVERSOLD" .. Reality:wallstreetbets
df = pd.concat([datascience, wsb])
df.subreddit.value_counts()
datascience       860
wallstreetbets    717
Name: subreddit, dtype: int64
df.head()
titleself_textsubreddit
0What Every Developer Should Know About GPU Com...datascience
1Do you use CRUD or like apps to bridge the gap...In my about 5 years of experience working for ...datascience
2Anybody ever been drug tested for handling sen...I am currently a DA for a company that uses da...datascience
3Any data imputation technique shares?Hello, \n\nI’ve been reading up some articles ...datascience
4Application of classical time series and deep ...datascience
df.tail()
titleself_textsubreddit
712Giving MCD the big DDThe McRib is back (again). logically speaking ...wallstreetbets
713Banner bank at risk banrDoes anyone know the rumors around which bank ...wallstreetbets
714Meta laying off most of Metaverse teams“Meta (META.O) is planning to lay off employee...wallstreetbets
715Mortgage rates just hit 8%Student loan payments start again this month, ...wallstreetbets
716Microsoft Needs So Much Power to Train AI That...Invest in small nuclear reactor manufacturerswallstreetbets

Shuffling the DataFrame

df = df.sample(frac = 1)
df[:10]
titleself_textsubreddit
71Sap ui5 fiori vs data scienceI'm in a part of my life where I hate my job. ...datascience
783PG Certification in Business Data AnalyticsHey!\n\nDo you have any reviews on the above m...datascience
205SHAP Deep Reinforcement LearningHi Guys,\n\nIs there a way to integrate SHAP w...datascience
537AAA service trucks are using Rivians nowwallstreetbets
304What do corporate data scientists struggle wit...As a data scientist, if you could let someone ...datascience
521Idea for a Tool - "Define your data science pr...Hey guys, I've worked with a lot of clients th...datascience
226Daily Discussion Thread for October 16, 2023**Join **[**WSB's community voice chat**](http...wallstreetbets
232How a slick accounting maneuver led to a $29 b...wallstreetbets
7Data Structures & Algorithms in Data Sciencehi ppl. I'm wondering if it is useful to learn...datascience
363Quantifying picture component to a wholeSimple example would be chopping a square into...datascience

Feature engineering and pre-processing

Merging title and self_text

df['post'] = df.apply(lambda row: f"title: {row['title']} text: {row['self_text']}", axis=1)
df.head()
titleself_textsubredditpost
71Sap ui5 fiori vs data scienceI'm in a part of my life where I hate my job. ...datasciencetitle: Sap ui5 fiori vs data science text: I'm...
783PG Certification in Business Data AnalyticsHey!\n\nDo you have any reviews on the above m...datasciencetitle: PG Certification in Business Data Analy...
205SHAP Deep Reinforcement LearningHi Guys,\n\nIs there a way to integrate SHAP w...datasciencetitle: SHAP Deep Reinforcement Learning text: ...
537AAA service trucks are using Rivians nowwallstreetbetstitle: AAA service trucks are using Rivians no...
304What do corporate data scientists struggle wit...As a data scientist, if you could let someone ...datasciencetitle: What do corporate data scientists strug...
df.drop(['title','self_text'], axis=1, inplace=True)
df.head()
subredditpost
71datasciencetitle: Sap ui5 fiori vs data science text: I'm...
783datasciencetitle: PG Certification in Business Data Analy...
205datasciencetitle: SHAP Deep Reinforcement Learning text: ...
537wallstreetbetstitle: AAA service trucks are using Rivians no...
304datasciencetitle: What do corporate data scientists strug...

Preprocessing

I making use of Regex to remove numbers and links from the post and creating a new column called cleaned post with processed text.

pattern = r'\b\d+\b|http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
df['cleaned_post'] = df['post'].replace(pattern, '', regex=True)

We will use countvectorizer with ngram_range:

#intitialize CountVectorizer 
vectorizer = CountVectorizer()
#fit 
wm = vectorizer.fit_transform(df['cleaned_post'])
wm.toarray()
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
df_vect_ex = pd.DataFrame(wm.toarray(), columns= vectorizer.get_feature_names_out(), index=df.index)
df_vect_ex
06pm0dte0dtes0s0t0th1000s100_000_000100bps100k...zazerozhangzhuzhzigzillowzonezoneszoomzoomer
710000000000...0000000000
7830000000000...0000000000
2050000000000...0000000000
5370000000000...0000000000
3040000000000...0000000000
..................................................................
3360000000000...0000000000
400000000000...0000000000
8380000000000...0000000000
5760000000000...0000000000
610000000000...0000000000

1577 rows × 11947 columns

df_vect_ex['target_subreddit'] = df['subreddit']
Count_w = df_vect_ex.drop('target_subreddit', axis=1).sum().sort_values(ascending = False)
import seaborn as sns
sns.barplot(x=Count_w.index[:10], y = Count_w[:10], color='purple')
plt.show()

def gettopten(df):
    nv = CountVectorizer(stop_words='english', token_pattern= (r'\b(?!http\b|https\b|www\b|ftp\b)(?<!http)(?<!https)(?<!www)(?<!ftp)'
           r'\b[^\d\W]+\b(?!.[a-zA-Z0-9]+\b)'))
    nvv = nv.fit_transform(df['cleaned_post'])
    df_no = pd.DataFrame(nvv.toarray(), columns= nv.get_feature_names_out(), index= df.index)
    new_count = df_no.sum().sort_values(ascending=False)
    return sns.barplot(x=new_count.index[:10],y=new_count[:10], palette='colorblind') 
gettopten(df)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Now we will check top 10 word counts in each subreddit

df[df['subreddit'] == 'datascience']
subredditpostcleaned_post
71datasciencetitle: Sap ui5 fiori vs data science text: I'm...title: Sap ui5 fiori vs data science text: I'm...
783datasciencetitle: PG Certification in Business Data Analy...title: PG Certification in Business Data Analy...
205datasciencetitle: SHAP Deep Reinforcement Learning text: ...title: SHAP Deep Reinforcement Learning text: ...
304datasciencetitle: What do corporate data scientists strug...title: What do corporate data scientists strug...
521datasciencetitle: Idea for a Tool - "Define your data sci...title: Idea for a Tool - "Define your data sci...
............
692datasciencetitle: Possibility of getting Data Science (Jr...title: Possibility of getting Data Science (Jr...
505datasciencetitle: AI Career text: I'm currently in my fir...title: AI Career text: I'm currently in my fir...
780datasciencetitle: Do people not use sci-kit learn / other...title: Do people not use sci-kit learn / other...
838datasciencetitle: Computer for Coding text: Hi everyone, ...title: Computer for Coding text: Hi everyone, ...
61datasciencetitle: Repetitive airflow pipeline problems te...title: Repetitive airflow pipeline problems te...

860 rows × 3 columns

gettopten(df[df['subreddit'] == 'datascience'])
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

gettopten(df[df['subreddit'] == 'wallstreetbets'])
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

def gettop10(df,n,stop='english'):
    cvec = CountVectorizer(ngram_range=(n,n), stop_words= stop)
 
    nvv = cvec.fit_transform(df['cleaned_post'])
    df_no = pd.DataFrame(nvv.toarray(), columns= cvec.get_feature_names_out(), index= df.index)
    new_count = df_no.sum().sort_values(ascending=False)
    plt.figure(figsize=(12,6))
    plt.tight_layout()
    return sns.barplot(x=new_count[:10],y=new_count.index[:10], palette='colorblind') 

Top 10 highest occuring bigrams in the entire dataset

gettop10(df,2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Discovery:

There seems to be a word that appears frequently called x200b. Upon further investigation, this is the unicode for whitespace character. We will need to modify our post and remove this chracter with help of regex.

df['cleaned_post'] = df['cleaned_post'].replace(r'x200B|text|title|\n|\'', '', regex=True)
gettop10(df,2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Top 10 occurring bigrams in the wallstreetbets

gettop10(df[df['subreddit'] == 'wallstreetbets'],2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Top 10 occurring bigrams in the datascience

gettop10(df[df['subreddit'] == 'datascience'],2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Top 10 occurring trigrams in the wallstreetbets

gettop10(df[df['subreddit'] == 'wallstreetbets'],3)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Top 10 occurring trigrams in the datascience

gettop10(df[df['subreddit'] == 'datascience'],3)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

Trigrams with stopwords

gettop10(df[df['subreddit'] == 'datascience'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

gettop10(df[df['subreddit'] == 'wallstreetbets'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

4-gram with stopwords

gettop10(df[df['subreddit'] == 'datascience'],4,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

gettop10(df[df['subreddit'] == 'wallstreetbets'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()

T-SNE Visulization

df.reset_index(drop=True, inplace=True)
color_palette = [ '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']
#'#E69F00', '#56B4E9','#0072B2'
 
def tsne_viz(df, n, stop='english'):
   
    cvec = TfidfVectorizer(ngram_range=(n,n), stop_words=stop)
    vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
    
    tsne = TSNE(n_components=3, random_state=42)
    tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
    
    fig = plt.figure(figsize=(12,8))
    ax = fig.add_subplot(111, projection='3d')
    
    scatter = ax.scatter(tsne_results[:,0], tsne_results[:,1], tsne_results[:,2], 
                         c=pd.factorize(df['subreddit'])[0], cmap="viridis", s=60)
    
    legend1 = ax.legend(*scatter.legend_elements(), title="Subreddits")
    ax.add_artist(legend1)
    
    ax.set_title('3D t-SNE Visualization')
    ax.set_xlabel('t-SNE Dimension 1')
    ax.set_ylabel('t-SNE Dimension 2')
    ax.set_zlabel('t-SNE Dimension 3')
    
    plt.show()
 
tsne_viz(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

 
def tsne_vizinter(df, n, stop='english'):
 
    cvec = TfidfVectorizer(ngram_range=(n-1,n), stop_words=stop)
    vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
       
    tsne = TSNE(n_components=3, random_state=42)
    tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
   
    df_tsne = pd.DataFrame(tsne_results, columns=['dim1', 'dim2', 'dim3']).reset_index(drop=True)
    df_tsne['subreddit'] = df['subreddit'].reset_index(drop=True)
    
    fig = px.scatter_3d(df_tsne, x='dim1', y='dim2', z='dim3', color='subreddit',color_discrete_sequence=color_palette)
    fig.show()
 
 
tsne_vizinter(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

Unable to display output for mime type(s): application/vnd.plotly.v1+json

We can see that bigrams and trigrams gives a mixed cluster

 
def tsne_viz_index(df, n, stop='english'):
 
    cvec = TfidfVectorizer(ngram_range=(n-1,n), stop_words=stop)
    vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
    
    tsne = TSNE(n_components=3, random_state=42)
    tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
    
    df_tsne = pd.DataFrame(tsne_results, columns=['dim1', 'dim2', 'dim3'])
    df_tsne['subreddit'] = df['subreddit'].reset_index(drop=True)
    df_tsne['index'] = df.index  # Add the index as a column
 
    fig = px.scatter_3d(df_tsne, x='dim1', y='dim2', z='dim3', color='subreddit', hover_data=['index'],color_discrete_sequence=color_palette)
    fig.show()
 
tsne_viz_index(df, 2, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:

The default initialization in TSNE will change from 'random' to 'pca' in 1.2.

c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:

The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.

Unable to display output for mime type(s): application/vnd.plotly.v1+json

We see unigram and bigram features give a good cluster.

print(df.iloc[501]['cleaned_post']),  # Checking out the outlier post in the above visuallization 
print(df.iloc[501]['subreddit'])
: My fault guys I bought  DTE Calls a minute before the drop : 
wallstreetbets

An outlier that only has a title and a : for text.

print(df.iloc[1148]['cleaned_post']),  # Checking out the outlier post in the above visuallization 
print(df.iloc[1148]['subreddit'])
: Isn’t Disney supposed to be under ? :       I have been watching this stock for really long time. It does look like company isn’t going to get better anytime soon. Even Ceo said in the interview that Disney is in worse shape than he thought.     Why are people still buying this stock? Is it solely because people are betting it’s going to turn around like meta?     I’m bullish on Disney but I’m just going to wait until it goes below .
wallstreetbets

This post is identified as an outlier by TSNE visulization correctly as it is talking about homw buying in the wallstreetbets subreddit.

print(df.iloc[59]['cleaned_post']),  # Checking out the outlier post in the above visuallization 
print(df.iloc[59]['subreddit'])
: Sick to my stomach - Lost 23K : I started with the about 8K investing at the beginning of this year. Had made it to little over 40K by end of September.  Today I disregarded all my stop loss rules, and personal limits and paid for it dearly. My mind was so set on chart patterns, Vs, and inverse Vs from all these days of trading, I was too confident that at some point, there would be a drop, and I kept buying puts on top of puts.  Let this be a lesson bools and bears.  I am really sad, angry and upset today.  Will see about Monday when Monday comes along.   
wallstreetbets

Another example of an outlier.

tsne_viz_index(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:

The default initialization in TSNE will change from 'random' to 'pca' in 1.2.

c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:

The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Trigrams and bigrams fetures give a mix cluster.

print(df.iloc[376]['cleaned_post']),  # Checking out the outlier post in the above visuallization 
print(df.iloc[376]['subreddit'])
: AI’s Data Cannibalism : Im looking to read more on this topic mentioned in the .&#;Feel free to suggest books and articles
datascience

This post is quite ambugious.

Further cleaning

import re # Source Chat GPT
 
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove Emojis
    emoji_pattern = re.compile(
        u"([\U00002600-\U000027BF])|"  # Misc symbols
        u"([\U0001F600-\U0001F64F])|"  # Emoticons
        u"([\U0001F300-\U0001F5FF])|"  # Symbols & pictographs
        u"([\U0001F680-\U0001F6FF])|"  # Transport & map symbols
        u"([\U0001F700-\U0001F77F])|"  # Alchemical symbols
        u"([\U0001F780-\U0001F7FF])|"  # Geometric shapes ext
        u"([\U0001F800-\U0001F8FF])|"  # Supplemental arrows C
        u"([\U0001F900-\U0001F9FF])|"  # Supplemental symbols
        u"([\U0001FA00-\U0001FA6F])|"  # Chess symbols
        u"([\U0001FA70-\U0001FAFF])"   # Symbols and pictographs ext A
        , re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove placeholder text
    text = re.sub(r'Daily Discussion Thread for [A-Za-z\s]+,', '', text)
    
    return text
df2 = df.copy()
df2['cleaned_post'] = df2['cleaned_post'].apply(clean_text)
tsne_viz_index(df2, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:

The default initialization in TSNE will change from 'random' to 'pca' in 1.2.

c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:

The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Even after further cleaning bigrams and trigrams fetures still give a mix cluster this implies that bigrams and trigrams fetures are not a good choise for modeling purposes.

print(df2.iloc[1162]['cleaned_post']) # Checking out the outlier post in the above visuallization 
print(df2.iloc[1162]['subreddit'])
: Eye Tracking Data : Hey all,I am a neuroscience Ph.D. student working with some eye-tracking data. The typical approach in my lab has been to try and fit the data to a GLM. Which is fine as a first pass, but I dont want to be limited to just that. I am curious if anyone else here has worked with eye-tracking data and can point me in the right direction. As far as the details are concerned, I am collecting eye-tracking data in few experimental cons. I would go into detail, but I want to stay at least a bit vague for privacy concerns. But to give you some idea of what I am doing, I have one task where participants are looking for a certain stimulus among distractor stimuli. The primary measurable output of this experiment is what stimulus they move their eyes to. But I am sure there is more information captured in the eye-tracking data that we can leverage. Another experiment is looking at overall gaze stability to infer cognitive mechanisms. If anyone is interested, I am willing to go in to more detail via PM. Any help would be appreciated! My first instinct to use some form of logistic regression or SVM and check performance. Let me know if I am on the right track.
datascience

This person just posted an emoji which TSNE vizulization correctly classifies as an extrem outlier.

print(df2.iloc[193]['cleaned_post']) # Checking out the outlier post in the above visuallization 
print(df2.iloc[193]['subreddit'])
: Hoping on the AMC bull train :
wallstreetbets

This is the same person.

tsne_viz_index(df2, 4, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:

The default initialization in TSNE will change from 'random' to 'pca' in 1.2.

c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:

The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.

Unable to display output for mime type(s): application/vnd.plotly.v1+json

Trigrams and 4-grams give a mixed cluster.

The TSNE visualization shows that unigram and bigram features gives us the best clusters.

unigramandbigram.png

Modeling

X= df2['cleaned_post']
y=df2['subreddit'].reset_index(drop=True)
y= y.apply(lambda x: x.display_name)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)
 
pgrid = {
    'tvec__stop_words': [None, 'english'],
    'tvec__min_df': [1, 2, 3],
    'tvec__ngram_range': [(1, 1), (1, 2), (2,2)],
    'logit__penalty': ['l1','l2'],
    'logit__C': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50],
    'logit__max_iter': [ 2000],
    'logit__solver': ['liblinear']
 
 
}
  
 
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logit', LogisticRegression())
   ])
gs_tvec = GridSearchCV(pipe, pgrid, cv=10, n_jobs=6)
%%time
gs_tvec.fit(X_train, y_train)
Wall time: 6min 40s

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('logit', LogisticRegression())]),
             n_jobs=6,
             param_grid={'logit__C': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
                                      0.1, 0.5, 1, 5, 10, 50],
                         'logit__max_iter': [2000],
                         'logit__penalty': ['l1', 'l2'],
                         'logit__solver': ['liblinear'],
                         'tvec__min_df': [1, 2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
                         'tvec__stop_words': [None, 'english']})
gs_tvec.score(X_test, y_test)
0.959493670886076
gs_tvec.best_params_
{'logit__C': 5,
 'logit__max_iter': 2000,
 'logit__penalty': 'l2',
 'logit__solver': 'liblinear',
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

Let’s Also train a SVM Model

 
pgrid_svm = {
    'tvec__stop_words': [None, 'english'],
    'tvec__min_df': [1, 2, 3],
    'tvec__ngram_range': [(1, 1), (1, 2), (2,2)],
    'svm__penalty': ['l2'], # The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
    'svm__C': [0.00001,0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50],
    'svm__max_iter': [2000]
 
 
}
pipe_svm = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('svm', LinearSVC())
   ])
gs_tvec_SVM = GridSearchCV(pipe_svm, pgrid_svm, cv=10, n_jobs=6)
%%time
gs_tvec_SVM.fit(X_train, y_train)
Wall time: 59.1 s

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('svm', LinearSVC())]),
             n_jobs=6,
             param_grid={'svm__C': [1e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01,
                                    0.05, 0.1, 0.5, 1, 5, 10, 50],
                         'svm__max_iter': [2000], 'svm__penalty': ['l2'],
                         'tvec__min_df': [1, 2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
                         'tvec__stop_words': [None, 'english']})
gs_tvec_SVM.best_params_
{'svm__C': 0.5,
 'svm__max_iter': 2000,
 'svm__penalty': 'l2',
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}
gs_tvec_SVM.score(X_test, y_test)
0.9670886075949368

We have successfully trained a model that achieves 96% accuracy on the test set!!

preds = gs_tvec.predict(["I made $1000 on the stock market today! let's go baby!", 
                         "How do I remove null objects from my dataset?",
                         "Guys I need some investment decisions, please help.",
                         "I trained a Logistic Regression model to classify the subreddits of a given post."])
preds
array(['wallstreetbets', 'datascience', 'wallstreetbets', 'datascience'],
      dtype=object)
pd.DataFrame({"input": ["I made $1000 on the stock market today! let's go baby!", 
                         "How do I remove null objects from my dataset?",
                         "Guys I need some investment decisions, please help.",
                         "I trained a Logistic Regression model to classify the subreddits of a given post."],
            "model prediction": preds})
inputmodel prediction
0I made $1000 on the stock market today! let's ...wallstreetbets
1How do I remove null objects from my dataset?datascience
2Guys I need some investment decisions, please ...wallstreetbets
3I trained a Logistic Regression model to class...datascience

Conclusion:

Our best model has accuracy of > 95%.

The classifier we built correctly classifies the post based on the patterns it learned during training. As can be seen in the testing did on made up posts!

inputmodel prediction
I made $1000 on the stock market today! let’s go baby!wallstreetbets
How do I remove null objects from my dataset?datascience
Guys I need some investment decisions, please help.wallstreetbets
I trained a Logistic Regression model to classify the subreddits of a given post.datascience
© Muhammad Hassan.RSS