Identifying machine learning data to make predictions ~ Discussion of Coding

Identifying machine learning data to make predictions

As a learning exercise I plan to implement a machine learning algorithm (probably neural network) to predict what users earn trading stocks based on shares bought , sold and transaction times. Below datasets are test data I've formulated.

acronym's :

tab=millisecond time apple bought  asb=apple shares bought  tas=millisecond apple sold  ass=apple shares sold  tgb=millisecond time google bought  gsb=google shares bought  tgs=millisecond google sold  gss=google shares sold

training data :

username,tab,asb,tas,ass,tgb,gsb,tgs,gss  a,234234,212,456789,412,234894,42,459289,0  b,234634,24,426789,2,234274,3,458189,22  c,239234,12,156489,67,271274,782,459120,3  d,234334,32,346789,90,234254,2,454919,2

classifications :

a earned $45  b earned $60  c earned ?  d earned ?

Aim : predict earnings of users c & d based on training data

Is there any data points I should add to this data set? I should use alternative data perhaps ? As this is just a learning exercise of my own creation can add any feature that may be useful.

This data will need to be normalised, is there any other concept I should be aware of ? Perhaps should not use time as a feature parameter as shares can bounce up and down depending on time.

Answer by 2PacIsAlive for Identifying machine learning data to make predictions

Don't use the username along with the training data - the network might make associations between the username and the $ earned. Including it would factor in the user to the output decision, while excluding it ensures the network will be able to predict the $ earned for an arbitrary user.

Answer by Mikhailov Valentine for Identifying machine learning data to make predictions

Using parameter that you are suggesting seems me impossible to predict earnings.

The main reason is that input parameters don't correlate with output value.

You input values contradicts itself - consider such case is it possible that for the same input you will expect different output values? If so you won't be able predict any output for such input. Let's go further, earnings of trader depend not only from a share of bought/sold stocks, but also from price of each one of them. This will bring us to the problem when we provide to neural network two equals input but desire different outputs.

How to define 'good' parameters to predict desired output in such case? I suggest first of all to look for people who do such estimations then try to define a list of parameters they take into account. If you will succeed you will get a huge list of variables. Then you can try to build some model for example, using neural network.

Answer by greeness for Identifying machine learning data to make predictions

You might want to solve your problem in below order:

Prediction for an individual stock's future value based on all stock's historical data.
Prediction for a combination of stocks' total future value based on a portfolio and all stocks' historical data.
A buy-sell short-term strategy for managing a portfolio. (when and what amount to buy/sell on which stock(s) )

If you can do 1) well for a particular stock, probably it's a good starting point for 2). 3) might be your goal but I put it in the last because it's even more complicated.

I would make some assumptions below and focus on how to solve 1) hopefully. :)

I assume at each timestamp, you have a vector of all possible features, e.g.:

stock price of company A (this is the target value)
stock price of other companies B, C, ..., Z (other companies might affect company A directly or indirectly)
52 week lowest price of A, B, C, ..., Z (long-term features begin)
52 week highest price of A, B, C, ..., Z
monthly highest/lowest price of A, B, C, ..., Z
weekly highest/lowest price of A, B, C, ..., Z (short-term features begin)
daily highest/lowest price of A, B, C, ..., Z
is revenue report day of A, B, C, ..., Z (really important features begin)
change of revenue of A, B, C, ..., Z
change of profit of of A, B, C, ..., Z
semantic score of company profile from social networks of A, ..., Z
... (imagination helps here)

And I assume you have almost all above features at every fixed time interval.

I think a lstm-like neural network is very relevant here.

Answer by Yura Zaletskyy for Identifying machine learning data to make predictions

Besides normalisation you'll also need scaling. Another question, which I have for you is classification of stocks. In your example you provide google and apple which are considered as blue-chipped stocks. I want to clarify, you want to make prediction of earning only for google and apple or prediction for any combination of two stocks?

If you want to make prediction only for google and apple and provide data which you have, then you can apply only normalization and scaling with some kind of recurrent neural network. Recurrent NN are better in prediction tasks then simple model of feedforward with backpropagation training.

But in case if you want to apply your training algorithm to more then just google and apple, I recommend you to split your training data into some groups by some criteria. One example of dividing can be according to capitalization of stocks. And if you want to make capitalization dividing, you can make five groups ( as example ). And if you decide to make five groups of stocks, you can also apply equilateral encoding in order to decrease number of dimensions for NN learning.

Another kind of grouping which you can think of can be area of operation of stock. For example agricultural, technological, medical, hi-end, tourist groups. Let's say you decided to give this grouping as mentioned ( I mean agricultural, technological, medical, hi-end, tourist). Then five groups will give you five entries into NN to input layer ( so called thermometer encoding ).

And let's say you want to feed agricultural stock.

Then input will look like this: 1,0,0,0,0, x1, x2, ...., xn

Where x1, x2, ...., xn - are other entries. Or if you apply equilateral encoding, then you'll have one dimension less ( I'm to lazy to describe how it will look like ).

Yet one more idea for converting entries for neural network can be thermometer encoding.

And one more idea to keep in your mind, as usually people loose on trading stocks, so your data set will be biased. I mean if you randomly choose only 10 traders, they all can be losers, and your data set will not be completely representative. So in order to avoid data bias, you should have big enough data set of traders.

And one more detail, you don't need to pass into NN user id, because NN then learn trading style of particular user, and use it for prediction.

Answer by g24l for Identifying machine learning data to make predictions

Seems to me dimensions are more than data points. However, it might be the case that your observations are in a linear sub space, you just need to compute the kernel of the matrix shown above.

If the kernel has a larger dimension than the number of data points then you do not need add more data points.

Now there is another thing to look at. You should check out your classifier's VC dimension, don't want to add too many points to the dataset. But anyway that is mostly theoretical in this example, and I'm just joking.

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Monday, February 15, 2016