HAN's BLOG

The prediction of stock's price using Linear Regression for Machine Learning (1)

Outline

1. Introduction

1. Introduction

This is purpose is prediction of stock’s price using ML with financial statements.

I have already gotten financial statements up to yahoo’s fianance web.This method have alreday been explained by this link. If you want how to get, you should check it. And this full idea has been already explained for kaggle.

The overall method will be carried out in the following …

  1. Dats Preprocessing
  2. Correlation for features
  3. Modeling
  4. Conclusion

Data Preprocessing

It is imfortant to analyze data for ML. I only use the recent value and numeric datas. Some example are B (bilion) -> 10^9 and etc… Link. And then, I load preprocessed data.

Previous Data Preprocessing

Load Financial statements of stock preprocessed

df_stats = pd.read_json(url+'/data_preprocessing/{0}_stats_element.json'.format(index_name))
df_addstats = pd.read_json(url+'/data_preprocessing/{0}_addstats_element.json'.format(index_name))
df_balsheets = pd.read_json(url+'/data_preprocessing/{0}_balsheets_element.json'.format(index_name))
df_income = pd.read_json(url+'/data_preprocessing/{0}_income_element.json'.format(index_name))
df_flow = pd.read_json(url+'/data_preprocessing/{0}_flow_element.json'.format(index_name))

Merge dataframe

df = pd.concat([df_stats, df_addstats, df_balsheets, df_income, df_flow], axis=1)

Check numeric datasets

from pandas.api.types import is_numeric_dtype
num_cols = [is_numeric_dtype(dtype) for dtype in df.dtypes]

Split data and test for correlation

from sklearn.model_selection import train_test_split
train_df_corr, test_df_corr = train_test_split(df, test_size=0.2)

Correlation for features and Heatmap

corrmat = train_df_corr.corr()
top_corr_features = corrmat.index[abs(corrmat['marketCap'])>0]
plt.figure(figsize=(13,10))
plt_corr = sns.heatmap(train_df_corr[top_corr_features].corr(), annot=True)

We can think how to consider features for changing the degree of correlation.

It is a value that does not take into account any degree.

Heat map

There are many values that do not matter if viewed simply because there are tickers that contain insufficient information.Therefore, we need additional preprocessing to process in sufficient information.