| 1 |
== cornea == |
|---|
| 2 |
|
|---|
| 3 |
Digital asset storage and service over HTTP. |
|---|
| 4 |
|
|---|
| 5 |
Each asset will have a serviceId and a assetId identifier. |
|---|
| 6 |
|
|---|
| 7 |
Each asset has multiple possible representations. |
|---|
| 8 |
|
|---|
| 9 |
An asset to serve or store is uniquely identified by the combination of |
|---|
| 10 |
(serviceId,assetId,rep) |
|---|
| 11 |
|
|---|
| 12 |
For example, a picture of a person for a profile could be store in the |
|---|
| 13 |
"profile pictures" serviceId = 17. Bob's profile id picture could be |
|---|
| 14 |
assetId = 314. Each profile picture needs an original size (rep = 0), |
|---|
| 15 |
and 3 derivative sizes (rep 1: 500x500, rep 2: 150x150, rep 3: 64x64). |
|---|
| 16 |
|
|---|
| 17 |
So, four serviceable assets exist: (17,314,0) (17,314,1) (17,314,2) (17,314,3) |
|---|
| 18 |
|
|---|
| 19 |
The storage system itself knows nothing of the different representations. |
|---|
| 20 |
It does not know that the asset being stored is a jpeg or a mpeg file. It |
|---|
| 21 |
also has no idea what 500x500 really means. |
|---|
| 22 |
|
|---|
| 23 |
When assets are stored in the system, each is stored independently. |
|---|
| 24 |
|
|---|
| 25 |
== Terms == |
|---|
| 26 |
|
|---|
| 27 |
RecallTable: database that tracks where assets live. |
|---|
| 28 |
StorageNode: a box of disks. |
|---|
| 29 |
A StorageNode has the following states: |
|---|
| 30 |
open: online and accepting uploads (has available storage) |
|---|
| 31 |
closed: online and not accepting uploads (out of space) |
|---|
| 32 |
offline: unavailable |
|---|
| 33 |
decommissioned: unavailable never to return to service |
|---|
| 34 |
|
|---|
| 35 |
== Upload == |
|---|
| 36 |
|
|---|
| 37 |
An asset is uploaded. Along with an asset, the serviceId, assetId and rep |
|---|
| 38 |
are provided. (think of them as a path name) |
|---|
| 39 |
|
|---|
| 40 |
STORE(input,serviceId,assetId,rep): |
|---|
| 41 |
|
|---|
| 42 |
// Gets number of open nodes specified for serviceId,rep |
|---|
| 43 |
repinfo = r.repInfo(serviceId,rep) |
|---|
| 44 |
N = r.getOpenNodes() |
|---|
| 45 |
|
|---|
| 46 |
// get it onto one of the nodes |
|---|
| 47 |
S = () |
|---|
| 48 |
for N as n |
|---|
| 49 |
if n.PUT(input,serviceId,assetId,rep) |
|---|
| 50 |
// note we stored, remove it from destinations and make it gold |
|---|
| 51 |
S.push(n) |
|---|
| 52 |
N.remove(n) |
|---|
| 53 |
gold = n |
|---|
| 54 |
break |
|---|
| 55 |
else |
|---|
| 56 |
N.remove(n) |
|---|
| 57 |
|
|---|
| 58 |
if |S| == 0 return failure |
|---|
| 59 |
|
|---|
| 60 |
// distancedNodes takes N and removes elements that are "too close" to those in S. |
|---|
| 61 |
// "too close" will be defined in repinfo. |
|---|
| 62 |
// Use the DC/Cage/Row/Rack/PDU concept. |
|---|
| 63 |
while |S| < repinfo.replicationCount |
|---|
| 64 |
T = repinfo.distancedNodes(N,S) |
|---|
| 65 |
if |T| == 0 break // can't find adequate nodes |
|---|
| 66 |
for T as n |
|---|
| 67 |
if n.PUT(gold,serviceId,assetId,rep) |
|---|
| 68 |
S.push(n) |
|---|
| 69 |
N.remove(n) |
|---|
| 70 |
break |
|---|
| 71 |
else |
|---|
| 72 |
T.remove(n) |
|---|
| 73 |
if |T| == 0 break // can't find adequate working nodes |
|---|
| 74 |
|
|---|
| 75 |
if |S| < repinfo.replicationCount // not safe enough |
|---|
| 76 |
or // or |
|---|
| 77 |
not r.insert(serviceId,assetId,rep,S) // cannot record locations |
|---|
| 78 |
for S as n |
|---|
| 79 |
n.DELETE(serviceId,assetId,rep) // no error checking |
|---|
| 80 |
return failure |
|---|
| 81 |
if |repinfo.dependents| |
|---|
| 82 |
if not q.enqueue(serviceId,assetId,rep) |
|---|
| 83 |
for S as n |
|---|
| 84 |
n.DELETE(serviceId,assetId,rep) // no error checking |
|---|
| 85 |
return failure |
|---|
| 86 |
|
|---|
| 87 |
return success |
|---|
| 88 |
|
|---|
| 89 |
|
|---|
| 90 |
WORKER: |
|---|
| 91 |
|
|---|
| 92 |
while (serviceId,assetId,repIn) = q.dequeue() |
|---|
| 93 |
repinfo = r.repInfo(serviceId,repIn) |
|---|
| 94 |
if |repinfo.dependents| == 0 continue // should never happen |
|---|
| 95 |
N = r.find(serviceId,assetId,repIn) |
|---|
| 96 |
for N as n |
|---|
| 97 |
if input = n.FETCH(serviceId,assetId,repIn) |
|---|
| 98 |
break |
|---|
| 99 |
if not input log error and continue |
|---|
| 100 |
|
|---|
| 101 |
for repinfo.dependents as repOut |
|---|
| 102 |
output = transform(serviceId,input,repIn,repOut) |
|---|
| 103 |
if not output log error and continue |
|---|
| 104 |
STORE(output,serviceId,assetId,repOut) |
|---|
| 105 |
|
|---|
| 106 |
MAINTENANCE: |
|---|
| 107 |
|
|---|
| 108 |
when decommissioned... find all images with that node in its set |
|---|
| 109 |
copy those to a new random open node and replace the row in RecallTable |
|---|
| 110 |
|
|---|
| 111 |
|
|---|
| 112 |
|
|---|
| 113 |
= Project Task List = |
|---|
| 114 |
|
|---|
| 115 |
Configuration includes: |
|---|
| 116 |
* list of metadata nodes. |
|---|
| 117 |
|
|---|
| 118 |
Storage node "setup". This includes minimal OS install and base setup of software. |
|---|
| 119 |
* OpenSolaris |
|---|
| 120 |
* Apache 2.2 |
|---|
| 121 |
* PostgreSQL libs |
|---|
| 122 |
* memcached |
|---|
| 123 |
* erlang / RabbitMQ |
|---|
| 124 |
* self assessment kit (health/heartbeat to metadata nodes) |
|---|
| 125 |
* scrubber |
|---|
| 126 |
* disk scrubbing |
|---|
| 127 |
|
|---|
| 128 |
Metadata node "setup". This includes a minimal PostgreSQL install w/ pgBouncer. |
|---|
| 129 |
* PostgreSQL 8.3+ |
|---|
| 130 |
* schema RecallTable (parent) |
|---|
| 131 |
RecallTable_{hostid} child. |
|---|
| 132 |
* pgBouncer |
|---|
| 133 |
|
|---|
| 134 |
Processing node "setup". This is more freeform. Use Gearman? |
|---|